Closed rafatower closed 6 years ago
PS: what I'm looking forward is to receiving some early feedback, before going any further, if possible. Thanks!
@rafatower thanks for taking this one! ❤️
Looks good so far, adding here just some suggestions to simplify the usage:
copyfrom
and copyto
methods? I'm thinking in something like this:result = copyClient.copyfrom(query, 'copy_from.csv')
copyClient.copyto(query, 'export.csv')
Where the read and write happens inside the methods.
I don't want you to spend more time than necessary on this, so we can add this as a FR for next version:
We could make the query parameter optional for the copyfrom
endpoint and derive it from the csv itself. So anyone looking to upload a CSV shouldn't learn about the COPY SQL syntax.
You could just write something like:
result = copyClient.copyfrom('copy_from.csv')
# this will read the header of the csv, build the COPY SQL, open the file and call the `copyfrom` endpoint.
CopySqlClient.copyto
API inside the Dataset
resource.So to export a table you would write something like this:
DATASET_ID = "world_borders"
dataset_manager = DatasetManager(auth_client)
dataset = dataset_manager.get(DATASET_ID)
dataset.copyto("world_borders.csv") # This would export the dataset to "world_borders.csv". We have a name attribute in the `dataset` so the file name could be optional as well and just write `dataset.copyto()`
Two quick comments on your feedback:
But of course your comments re. usability are perfectly valid and I'll give a try at accommodating it.
Thanks for the prompt answer!
El vie., 27 jul. 2018 19:30, Alberto Romeu notifications@github.com escribió:
@rafatower https://github.com/rafatower thanks for taking this one! ❤️
Looks good so far, adding here just some suggestions to simplify the usage:
- Could not be cleaner to include the file handling inside the copyfrom and copyto methods? I'm thinking in something like this:
result = copyClient.copyfrom(query, 'copy_from.csv')
copyClient.copyto(query, 'export.csv')
Where the read and write happens inside the methods.
- For the shake of simplicity I would suggest two changes in the way both APIs are used.
I don't want you to spend more time than necessary on this, so we can add this as a FR for next version:
We could make the query parameter optional for the copyfrom endpoint and derive it from the csv itself. So anyone looking to upload a CSV shouldn't learn about the COPY SQL syntax.
You could just write something like:
result = copyClient.copyfrom('copy_from.csv')
this will read the header of the csv, build the COPY SQL, open the file and call the
copyfrom
endpoint.
- We could also wrap the CopySqlClient.copyto API inside the Dataset resource.
So to export a table you would write something like this:
DATASET_ID = "world_borders" dataset_manager = DatasetManager(auth_client) dataset = dataset_manager.get(DATASET_ID) dataset.copyto("world_borders.csv") # This would export the dataset to "world_borders.csv". We have a name attribute in the
dataset
so the file name could be optional as well and just writedataset.copyto()
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CartoDB/carto-python/pull/87#issuecomment-408486724, or mute the thread https://github.com/notifications/unsubscribe-auth/AB01pM0xfujNsSklsM-ZEZ-hj_RuiDjkks5uK04pgaJpZM4Vj34L .
I tested manually with a script similar to the one in the /examples
, but with a 1.2 GB input file.
The point here is that the peak memory allocated by the python script was less than 20 MB (Maximum resident set size (kbytes): 19176
).
So, it does not need to read the full CSV file in memory, but requests
reads from the file
object, and sends chunks through the network.
The way it works underneath (to me it was important to confirm this):
data
to the copyfrom
methodCopySQLClient
uses the APIKeyAuthClient
, that in turn uses the BaseAuthClient
from pyrestcli
self.client.send
is actually calling self.session.request
which uses the requests
libraryrequests.Session.requests
which calls to requests.Session.send
which delegates the call to an adapter, particularly the HTTPAdapter
As you can see, the only requirement for the data
is to be an interator of some sort. There's also a test for it here. The request.body
is populated here BTW.
I did further tests and found out that it is not really streaming. It is sending the Content-Length
header which disables the chunked transfer mode here.
This is a small piece of code but will need more time than I anticipated to get it right...
The Content-Length
header is set here: https://github.com/requests/requests/blob/5aeec8b661f992050b1d034e727d7ce0326e9a56/requests/models.py#L475-L500
(I'm figuring half of this by patching and debugging with pdb)
I think this is closer to what we should do: https://gist.github.com/nbari/7335384
but it is a bit tough to integrate in a nice, "safe" and "usable" way. I keep digging...
I committed some sample code of how to send a chunked file through request and our library:
It is meant to be eventually integrated in the sql module. But I'm having trouble. See the commit message: https://github.com/CartoDB/carto-python/pull/87/commits/3234f757bf01bb7c2cc217f7a966f2f742a7bf78 :cry:
With that particular file, I may have this issue:
$ file /home/rtorre/src/cartodb-tests/sql_copy_perf/tl_2014_census_tracts.csv
/home/rtorre/src/cartodb-tests/sql_copy_perf/tl_2014_census_tracts.csv: ASCII text, with very long lines, with CRLF, LF line terminators
ERROR: unquoted carriage return found in data
HINT: Use quoted CSV field to represent carriage return.
CONTEXT: COPY tl_2014_census_tracts, line 2
a mixture of line terminators.
PS: my fault, I appended the header manually. The trouble to mark here is that we should make sure clients are able to receive a meaninful error.
With a curl client I get the error clearly in headers and response body:
< HTTP/1.1 100 Continue
< HTTP/1.1 400 Bad Request
< Server: openresty
< Date: Thu, 02 Aug 2018 10:53:44 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 112
< Connection: keep-alive
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Headers: X-Requested-With, X-Prototype-Version, X-CSRF-Token
< vary: Authorization
< Content-Disposition: inline
< X-SQLAPI-Profiler: {"authorization":1,"getConnectionParams":1,"finish":1563,"total":1565}
< X-SQLAPI-Errors: {"hint":"Use quoted CSV field to represent carriage return ","statusCode":400,"message":"unquoted carriage return found in data"}
* HTTP error before end of send, stop sending
<
* Closing connection 0
{"error":["unquoted carriage return found in data"],"hint":"Use quoted CSV field to represent carriage return."}
shame on requests
.
This trick from stackoverflow fixes the issue with mixed newline styles:
tr -d '\15\32' < windows.txt > unix.txt
With the file fixed, now it works (this is for a 1.2 GB file):
01:00:19 PM - INFO - Dropping, re-creating and cartodbfy'ing the table through the Batch API...
01:00:21 PM - INFO - Table created and cartodbfy'ed
01:00:21 PM - INFO - COPY'ing the table FROM the file...
01:03:51 PM - INFO - {u'total_rows': 73915, u'time': 209.43}
01:03:51 PM - INFO - Table populated
:smile:
@javitonino and @juanignaciosl since you are the ones around I'm asking you for this review :pray:
Re. tests: most of the failing ones were broken before I touched this codebase :innocent: . But I created new E2E tests that do not work in travis because of AuthErrorException
and the like. I'd appreciate any help in fixing those (I don't think I have permissions to check user, api keys and the rest of secrets).
They work in local with any api_key with enough privileges, both with python 2 and 3:
(env) [rtorre@fireflies carto-python (copy-endpoints-support)]$ python --version
Python 3.5.2
(env) [rtorre@fireflies carto-python (copy-endpoints-support)]$ py.test tests/test_sql_copy.py
================================================================================================================================== test session starts ===================================================================================================================================
platform linux -- Python 3.5.2, pytest-3.7.1, py-1.5.4, pluggy-0.7.1
rootdir: /home/rtorre_unencrypted/src/carto-python, inifile:
plugins: requests-mock-1.5.2
collected 8 items
tests/test_sql_copy.py ........ [100%]
================================================================================================================================ 8 passed in 5.30 seconds ================================================================================================================================
(env2) [rtorre@fireflies carto-python (copy-endpoints-support)]$ python --version
Python 2.7.12
(env2) [rtorre@fireflies carto-python (copy-endpoints-support)]$ py.test tests/test_sql_copy.py
================================================================================================================================== test session starts ===================================================================================================================================
platform linux2 -- Python 2.7.12, pytest-3.7.1, py-1.5.4, pluggy-0.7.1
rootdir: /home/rtorre_unencrypted/src/carto-python, inifile:
plugins: requests-mock-1.5.2
collected 8 items
tests/test_sql_copy.py ........ [100%]
================================================================================================================================ 8 passed in 5.06 seconds ================================================================================================================================
About the other tests, I'm not really in a position of fixing them. Hopefully they fail because of the same perms issue.
My original intent was to release this branch this week.
Thanks in advance for the review and the help with the tests...
I got a look at the nginx logs and the API key being used was deleted or re-generated. I'll do my best to fix it, but it is a bit of a PITA without access to the travis secrets. Any help is appreciated.
I'll take a look at this later, but I don't have access to the build settings at Travis. Maybe @alberhander can help with that.
I got access and changed the master key in the config. In general tests do table creation/deletion and otherwise they get a 401 error. Let's see the results in the next round...
Modesty aside, tests are cream of E2E.
PS: BTW, nice work with the tests! :muscle:
Wonderful examples 👍
I think I covered all the comments (thanks!) and now this is ready for release.
@juanignaciosl I'd do it myself but I can't yet, if you could help me... :hugs:
@rafatower ok, let me know when you're around and we'll pair this.
Hey, @alrocar this is a basic implementation of a COPY API client. I tried to make it match other interfaces as much as I could.
I know there's a little bit of duplication and lack of automated tests, but would like you to take a quick look and let me know how it looks overall.
(Most likely tests will fail, but it is not my fault (yet). Check the build history. I took a look and tried to figure out some of the errors but decided to move forward for the sake of completing the task at hand. I think most of the failures are due to changes in private API's).
Related ticket: https://github.com/CartoDB/carto-python/issues/83