CartoDB / cartoframes

CARTO Python package for data scientists
BSD 3-Clause "New" or "Revised" License
251 stars 63 forks source link

fail to upload a large data table using cc.write #1012

Closed djfan closed 4 years ago

djfan commented 5 years ago

The dataset is around 1.6 gb including geometry column. Failed to upload it using cc.write. Here's the error message.

---------------------------------------------------------------------------
SysCallError                              Traceback (most recent call last)
/opt/conda/anaconda/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in _send_until_done(self, data)
    319             try:
--> 320                 return self.connection.send(data)
    321             except OpenSSL.SSL.WantWriteError:

/opt/conda/anaconda/lib/python3.6/site-packages/OpenSSL/SSL.py in send(self, buf, flags)
   1736         result = _lib.SSL_write(self._ssl, buf, len(buf))
-> 1737         self._raise_ssl_error(self._ssl, result)
   1738         return result

/opt/conda/anaconda/lib/python3.6/site-packages/OpenSSL/SSL.py in _raise_ssl_error(self, ssl, result)
   1638                     if errno != 0:
-> 1639                         raise SysCallError(errno, errorcode.get(errno))
   1640                 raise SysCallError(-1, "Unexpected EOF")

SysCallError: (32, 'EPIPE')

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
/opt/conda/anaconda/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    469                     for i in request.body:
--> 470                         low_conn.send(hex(len(i))[2:].encode('utf-8'))
    471                         low_conn.send(b'\r\n')

/opt/conda/anaconda/lib/python3.6/http/client.py in send(self, data)
    995         try:
--> 996             self.sock.sendall(data)
    997         except TypeError:

/opt/conda/anaconda/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in sendall(self, data)
    330         while total_sent < len(data):
--> 331             sent = self._send_until_done(data[total_sent:total_sent + SSL_WRITE_BLOCKSIZE])
    332             total_sent += sent

/opt/conda/anaconda/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in _send_until_done(self, data)
    325             except OpenSSL.SSL.SysCallError as e:
--> 326                 raise SocketError(str(e))
    327 

OSError: (32, 'EPIPE')

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
/opt/conda/anaconda/lib/python3.6/site-packages/carto/auth.py in send(self, relative_path, http_method, **requests_args)
    146 
--> 147             response = super(APIKeyAuthClient, self).send(relative_path, http_method, **requests_args)
    148         except Exception as e:

/opt/conda/anaconda/lib/python3.6/site-packages/pyrestcli/auth.py in send(self, relative_path, http_method, **requests_args)
     35 
---> 36         return self.session.request(http_method, url, **requests_args)
     37 

/opt/conda/anaconda/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    532         send_kwargs.update(settings)
--> 533         resp = self.send(prep, **send_kwargs)
    534 

/opt/conda/anaconda/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
    645         # Send the request
--> 646         r = adapter.send(request, **kwargs)
    647 

/opt/conda/anaconda/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    497         except (ProtocolError, socket.error) as err:
--> 498             raise ConnectionError(err, request=request)
    499 

ConnectionError: (32, 'EPIPE')

During handling of the above exception, another exception occurred:

CartoException                            Traceback (most recent call last)
/opt/conda/anaconda/lib/python3.6/site-packages/carto/sql.py in copyfrom(self, query, iterable_data, compress, compression_level)
    378                                         headers=headers,
--> 379                                         stream=True)
    380             response_json = self.client.get_response_data(response)

/opt/conda/anaconda/lib/python3.6/site-packages/carto/auth.py in send(self, relative_path, http_method, **requests_args)
    148         except Exception as e:
--> 149             raise CartoException(e)
    150 

CartoException: (32, 'EPIPE')

During handling of the above exception, another exception occurred:

CartoException                            Traceback (most recent call last)
<ipython-input-18-96c4b3430bd5> in <module>
----> 1 ap.write(data, 'appearhere_uk_2019aug_grid_18', overwrite=True)

/opt/conda/anaconda/lib/python3.6/site-packages/cartoframes/context.py in write(self, df, table_name, temp_dir, overwrite, lnglat, encode_geom, geom_col, **kwargs)
    349             if_exists = Dataset.REPLACE
    350 
--> 351         dataset.upload(with_lnglat=lnglat, if_exists=if_exists, table_name=table_name, context=self)
    352 
    353         tqdm.write('Table successfully written to CARTO: {table_url}'.format(

/opt/conda/anaconda/lib/python3.6/site-packages/cartoframes/data/dataset.py in upload(self, with_lnglat, if_exists, table_name, schema, context)
    419                 raise already_exists_error
    420 
--> 421             self._copyfrom(with_lnglat)
    422 
    423         elif self._query is not None:

/opt/conda/anaconda/lib/python3.6/site-packages/cartoframes/data/dataset.py in _copyfrom(self, with_lnglat)
    553             """COPY {table_name}({columns},the_geom)
    554                FROM stdin WITH (FORMAT csv, DELIMITER '|');""".format(table_name=self._table_name, columns=columns),
--> 555             self._rows(self._df, [c for c in self._df.columns if c != 'cartodb_id'], with_lnglat, geom_col)
    556         )
    557 

/opt/conda/anaconda/lib/python3.6/site-packages/carto/sql.py in copyfrom(self, query, iterable_data, compress, compression_level)
    382             raise e
    383         except Exception as e:
--> 384             raise CartoException(e)
    385 
    386         return response_json

CartoException: (32, 'EPIPE')
alrocar commented 4 years ago

A fix in the COPY API has been deployed, see related issue.

@djfan could you help us to test a Dataset.upload with a more than 1GB csv?

Thanks!

Jesus89 commented 4 years ago

@andy-esch did a test with a ~1GB dataset without issues (https://github.com/CartoDB/cartoframes/issues/936#issuecomment-560530516). So I think we can close this one.

Open it if the bug appears again.

djfan commented 4 years ago

sorry it happens again using b6

if you want to try, here's the data https://drive.google.com/file/d/1EMLiPWHT61eJpR_FLbCjW35E1XRMNEoZ/view?usp=sharing

Screen Shot 2019-12-04 at 4 29 35 PM

cc @andy-esch

andy-esch commented 4 years ago

@djfan can you include the full stacktrace?

Jesus89 commented 4 years ago

Is it because the size is >1GB? Could you test without 50% of the rows?

djfan commented 4 years ago

@andy-esch sure

full stacktrace

``` Debug: creating table "ph_grid_data" --------------------------------------------------------------------------- SysCallError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in _send_until_done(self, data) 316 try: --> 317 return self.connection.send(data) 318 except OpenSSL.SSL.WantWriteError: ~/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py in send(self, buf, flags) 1728 result = _lib.SSL_write(self._ssl, buf, len(buf)) -> 1729 self._raise_ssl_error(self._ssl, result) 1730 return result ~/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py in _raise_ssl_error(self, ssl, result) 1630 if errno != 0: -> 1631 raise SysCallError(errno, errorcode.get(errno)) 1632 raise SysCallError(-1, "Unexpected EOF") SysCallError: (32, 'EPIPE') During handling of the above exception, another exception occurred: OSError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 469 for i in request.body: --> 470 low_conn.send(hex(len(i))[2:].encode('utf-8')) 471 low_conn.send(b'\r\n') ~/anaconda3/lib/python3.6/http/client.py in send(self, data) 985 try: --> 986 self.sock.sendall(data) 987 except TypeError: ~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in sendall(self, data) 327 while total_sent < len(data): --> 328 sent = self._send_until_done(data[total_sent:total_sent + SSL_WRITE_BLOCKSIZE]) 329 total_sent += sent ~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in _send_until_done(self, data) 322 except OpenSSL.SSL.SysCallError as e: --> 323 raise SocketError(str(e)) 324 OSError: (32, 'EPIPE') During handling of the above exception, another exception occurred: ConnectionError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/carto/auth.py in send(self, relative_path, http_method, **requests_args) 146 --> 147 response = super(APIKeyAuthClient, self).send(relative_path, http_method, **requests_args) 148 except Exception as e: ~/anaconda3/lib/python3.6/site-packages/pyrestcli/auth.py in send(self, relative_path, http_method, **requests_args) 35 ---> 36 return self.session.request(http_method, url, **requests_args) 37 ~/anaconda3/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 532 send_kwargs.update(settings) --> 533 resp = self.send(prep, **send_kwargs) 534 ~/anaconda3/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs) 645 # Send the request --> 646 r = adapter.send(request, **kwargs) 647 ~/anaconda3/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 497 except (ProtocolError, socket.error) as err: --> 498 raise ConnectionError(err, request=request) 499 ConnectionError: (32, 'EPIPE') During handling of the above exception, another exception occurred: CartoException Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/carto/sql.py in copyfrom(self, query, iterable_data, compress, compression_level) 375 headers=headers, --> 376 stream=True) 377 response_json = self.client.get_response_data(response) ~/anaconda3/lib/python3.6/site-packages/carto/auth.py in send(self, relative_path, http_method, **requests_args) 148 except Exception as e: --> 149 raise CartoException(e) 150 CartoException: (32, 'EPIPE') During handling of the above exception, another exception occurred: CartoException Traceback (most recent call last) in ----> 1 data.to_carto('ph_grid_data', if_exists='replace') ~/anaconda3/lib/python3.6/site-packages/cartoframes/core/cartodataframe.py in to_carto(self, *args, **kwargs) 117 """ 118 from ..io.carto import to_carto --> 119 return to_carto(self, *args, **kwargs) 120 121 def viz(self, *args, **kwargs): ~/anaconda3/lib/python3.6/site-packages/cartoframes/io/carto.py in to_carto(dataframe, table_name, credentials, if_exists, geom_col, index, index_label, log_enabled, force_cartodbfy) 105 cartodbfy = force_cartodbfy or has_geometry 106 --> 107 context_manager.copy_from(cdf, table_name, if_exists, cartodbfy, log_enabled) 108 109 if log_enabled: ~/anaconda3/lib/python3.6/site-packages/cartoframes/core/managers/context_manager.py in copy_from(self, cdf, table_name, if_exists, cartodbfy, log_enabled) 60 table_name=table_name, schema=schema)) 61 ---> 62 return self._copy_from(cdf, table_name, dataframe_columns_info) 63 64 def create_table_from_query(self, table_name, query, if_exists, cartodbfy=True, log_enabled=True): ~/anaconda3/lib/python3.6/site-packages/cartoframes/core/managers/context_manager.py in _copy_from(self, dataframe, table_name, dataframe_columns_info) 261 data = _compute_copy_data(dataframe, dataframe_columns_info) 262 --> 263 self.copy_client.copyfrom(query, data) 264 265 def normalize_table_name(self, table_name): ~/anaconda3/lib/python3.6/site-packages/carto/sql.py in copyfrom(self, query, iterable_data, compress, compression_level) 379 raise e 380 except Exception as e: --> 381 raise CartoException(e) 382 383 return response_json CartoException: (32, 'EPIPE') ```

djfan commented 4 years ago

@Jesus89 it's a ~5G shapefile. and the account is a trial account. Does that matter?

djfan commented 4 years ago

background & update:

it's the demographic data for philippine (using a special grid system).

Screen Shot 2019-12-05 at 1 02 11 PM

I was able to upload a small sample of it but failed for the whole dataset. In the end, matt used superadmin to log in to client account and uploaded it through builder dashboard.

alrocar commented 4 years ago

AFAIK there's a hard limit of 2GB in the COPY API, so that might be the culprit. Also the disk quota of the account matters of course, but it does not seem the problem in this case since you were able to do it from the Import API.

@djfan Could you tell us the name of the user?

@Jesus89 we should check if uploading a file bigger than 2GB raises this error from the COPY API: COPY FROM maximum POST size of 2GB exceeded and make sure we print that instead of the standard EPIPE thing.

djfan commented 4 years ago

@alrocar sure. yum-admin

alrocar commented 4 years ago

For the records -> https://carto.com/developers/sql-api/guides/copy-queries/#limits

The limits in superadmin apply to the Import API not to the COPY API, so the error makes sense.

After talking to @Jesus89 we might want to improve the error message in this specific case. We should be actually receiving this error COPY FROM maximum POST size of 2GB exceeded from the API. I'd try to capture it and add more meaningful information like: split the source dataset and upload in batches using if_exists="append" or even better a link to a guide or help center resource explaining how to do that.

andrewbt commented 4 years ago

Just for my own clarification and education, does the 2gb hard limit to COPY mean there is no way to upload datasets larger than 2 GB to CARTO? Or just that larger datasets need to be "chunked" into a series of smaller POST requests none of which exceeding 2 GB as part of the Transfer-Encoding: chunked header upload process typically used with COPY?

Maybe the Python SDK / cartoframes should add that transfer encoding header on all calls that use COPY by default to avoid the error.

oriolbx commented 4 years ago

@andrewbt With SQL API COPY FROM command itself you can compress the CSV files that are bigger than 2GB into a .gz file and add the header -H "Content-Encoding: gzip".

curl -X POST  \
    -H "Content-Encoding: gzip" \
    -H "Transfer-Encoding: chunked" \
    -H "Content-Type: application/octet-stream" \
    --data-binary @upload_example.csv.gz \
    "https://{username}.carto.com/api/v2/sql/copyfrom?api_key={api_key}&q=COPY+upload_example+(the_geom,+name,+age)+FROM+STDIN+WITH+(FORMAT+csv,+HEADER+true)"

If the compressed .gz file is lower than 2 GB, it can be used to upload data with the SQL API COPY FROM command, otherwise it needs to be chunked.

Within CARTOframes itself, I'm not sure how the method to_carto() handles this 2GB limit for the dataframe (if it compresses or not the data before uploading it into CARTO).

alrocar commented 4 years ago

does the 2gb hard limit to COPY mean there is no way to upload datasets larger than 2 GB to CARTO

COPY appends by default, so you can upload larger datasets as long as you split them properly.

Or just that larger datasets need to be "chunked" into a series of smaller POST requests none of which exceeding 2 GB as part of the Transfer-Encoding: chunked header upload process typically used with COPY?

Datasets travel compressed and with the Transfer-Encoding header. That allows to stream large files without the need to load them in memory (both in the client and the server).

The problem with this issue is there's a hard limit in the API, and it closes the connection after 2GB of data have been received.

The plan to fix this is:

  1. Properly catch the error and report to the user. They should split their file into smaller files of 2GB max.
  2. Make CF smarter by doing that automatically. For now this is a nice to have, since we don't have metrics on how this feature is going to be used, but at some point it could be feasible.
Jesus89 commented 4 years ago

There is also another approach using COPY BINARY, but this step requires more research for the generation of the binary query.

andrewbt commented 4 years ago

Just saw where the 2 GB limit is documented: https://carto.com/developers/sql-api/guides/copy-queries/#limits

Thanks! I feel like I've known the COPY endpoint a lot and yet never knew about this limit :)

Jesus89 commented 4 years ago

https://app.clubhouse.io/cartoteam/story/93723/optimize-upload-to-carto

Jesus89 commented 4 years ago

Done. It will be available in the next stable release 1.0.5