Closed djfan closed 4 years ago
A fix in the COPY API has been deployed, see related issue.
@djfan could you help us to test a Dataset.upload
with a more than 1GB csv?
Thanks!
@andy-esch did a test with a ~1GB dataset without issues (https://github.com/CartoDB/cartoframes/issues/936#issuecomment-560530516). So I think we can close this one.
Open it if the bug appears again.
sorry it happens again using b6
if you want to try, here's the data https://drive.google.com/file/d/1EMLiPWHT61eJpR_FLbCjW35E1XRMNEoZ/view?usp=sharing
cc @andy-esch
@djfan can you include the full stacktrace?
Is it because the size is >1GB? Could you test without 50% of the rows?
@andy-esch sure
```
Debug: creating table "ph_grid_data"
---------------------------------------------------------------------------
SysCallError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in _send_until_done(self, data)
316 try:
--> 317 return self.connection.send(data)
318 except OpenSSL.SSL.WantWriteError:
~/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py in send(self, buf, flags)
1728 result = _lib.SSL_write(self._ssl, buf, len(buf))
-> 1729 self._raise_ssl_error(self._ssl, result)
1730 return result
~/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py in _raise_ssl_error(self, ssl, result)
1630 if errno != 0:
-> 1631 raise SysCallError(errno, errorcode.get(errno))
1632 raise SysCallError(-1, "Unexpected EOF")
SysCallError: (32, 'EPIPE')
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
469 for i in request.body:
--> 470 low_conn.send(hex(len(i))[2:].encode('utf-8'))
471 low_conn.send(b'\r\n')
~/anaconda3/lib/python3.6/http/client.py in send(self, data)
985 try:
--> 986 self.sock.sendall(data)
987 except TypeError:
~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in sendall(self, data)
327 while total_sent < len(data):
--> 328 sent = self._send_until_done(data[total_sent:total_sent + SSL_WRITE_BLOCKSIZE])
329 total_sent += sent
~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in _send_until_done(self, data)
322 except OpenSSL.SSL.SysCallError as e:
--> 323 raise SocketError(str(e))
324
OSError: (32, 'EPIPE')
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/carto/auth.py in send(self, relative_path, http_method, **requests_args)
146
--> 147 response = super(APIKeyAuthClient, self).send(relative_path, http_method, **requests_args)
148 except Exception as e:
~/anaconda3/lib/python3.6/site-packages/pyrestcli/auth.py in send(self, relative_path, http_method, **requests_args)
35
---> 36 return self.session.request(http_method, url, **requests_args)
37
~/anaconda3/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
532 send_kwargs.update(settings)
--> 533 resp = self.send(prep, **send_kwargs)
534
~/anaconda3/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
645 # Send the request
--> 646 r = adapter.send(request, **kwargs)
647
~/anaconda3/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
497 except (ProtocolError, socket.error) as err:
--> 498 raise ConnectionError(err, request=request)
499
ConnectionError: (32, 'EPIPE')
During handling of the above exception, another exception occurred:
CartoException Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/carto/sql.py in copyfrom(self, query, iterable_data, compress, compression_level)
375 headers=headers,
--> 376 stream=True)
377 response_json = self.client.get_response_data(response)
~/anaconda3/lib/python3.6/site-packages/carto/auth.py in send(self, relative_path, http_method, **requests_args)
148 except Exception as e:
--> 149 raise CartoException(e)
150
CartoException: (32, 'EPIPE')
During handling of the above exception, another exception occurred:
CartoException Traceback (most recent call last)
@Jesus89 it's a ~5G shapefile. and the account is a trial account. Does that matter?
background & update:
it's the demographic data for philippine (using a special grid system).
I was able to upload a small sample of it but failed for the whole dataset. In the end, matt used superadmin to log in to client account and uploaded it through builder dashboard.
AFAIK there's a hard limit of 2GB in the COPY API, so that might be the culprit. Also the disk quota of the account matters of course, but it does not seem the problem in this case since you were able to do it from the Import API.
@djfan Could you tell us the name of the user?
@Jesus89 we should check if uploading a file bigger than 2GB raises this error from the COPY API: COPY FROM maximum POST size of 2GB exceeded
and make sure we print that instead of the standard EPIPE
thing.
@alrocar sure. yum-admin
For the records -> https://carto.com/developers/sql-api/guides/copy-queries/#limits
The limits in superadmin apply to the Import API not to the COPY API, so the error makes sense.
After talking to @Jesus89 we might want to improve the error message in this specific case. We should be actually receiving this error COPY FROM maximum POST size of 2GB exceeded
from the API. I'd try to capture it and add more meaningful information like: split the source dataset and upload in batches using if_exists="append"
or even better a link to a guide or help center resource explaining how to do that.
Just for my own clarification and education, does the 2gb hard limit to COPY mean there is no way to upload datasets larger than 2 GB to CARTO? Or just that larger datasets need to be "chunked" into a series of smaller POST requests none of which exceeding 2 GB as part of the Transfer-Encoding: chunked
header upload process typically used with COPY?
Maybe the Python SDK / cartoframes should add that transfer encoding header on all calls that use COPY by default to avoid the error.
@andrewbt With SQL API COPY FROM command itself you can compress the CSV files that are bigger than 2GB into a .gz file and add the header -H "Content-Encoding: gzip"
.
curl -X POST \
-H "Content-Encoding: gzip" \
-H "Transfer-Encoding: chunked" \
-H "Content-Type: application/octet-stream" \
--data-binary @upload_example.csv.gz \
"https://{username}.carto.com/api/v2/sql/copyfrom?api_key={api_key}&q=COPY+upload_example+(the_geom,+name,+age)+FROM+STDIN+WITH+(FORMAT+csv,+HEADER+true)"
If the compressed .gz file is lower than 2 GB, it can be used to upload data with the SQL API COPY FROM command, otherwise it needs to be chunked.
Within CARTOframes itself, I'm not sure how the method to_carto() handles this 2GB limit for the dataframe (if it compresses or not the data before uploading it into CARTO).
does the 2gb hard limit to COPY mean there is no way to upload datasets larger than 2 GB to CARTO
COPY appends by default, so you can upload larger datasets as long as you split them properly.
Or just that larger datasets need to be "chunked" into a series of smaller POST requests none of which exceeding 2 GB as part of the Transfer-Encoding: chunked header upload process typically used with COPY?
Datasets travel compressed and with the Transfer-Encoding
header. That allows to stream large files without the need to load them in memory (both in the client and the server).
The problem with this issue is there's a hard limit in the API, and it closes the connection after 2GB of data have been received.
The plan to fix this is:
There is also another approach using COPY BINARY, but this step requires more research for the generation of the binary query.
Just saw where the 2 GB limit is documented: https://carto.com/developers/sql-api/guides/copy-queries/#limits
Thanks! I feel like I've known the COPY endpoint a lot and yet never knew about this limit :)
Done. It will be available in the next stable release 1.0.5
The dataset is around 1.6 gb including geometry column. Failed to upload it using cc.write. Here's the error message.