Copy endpoints support - Githubissues

rafatower commented 6 years ago

Hey, @alrocar this is a basic implementation of a COPY API client. I tried to make it match other interfaces as much as I could.

I know there's a little bit of duplication and lack of automated tests, but would like you to take a quick look and let me know how it looks overall.

(Most likely tests will fail, but it is not my fault (yet). Check the build history. I took a look and tried to figure out some of the errors but decided to move forward for the sake of completing the task at hand. I think most of the failures are due to changes in private API's).

rafatower commented 6 years ago

PS: what I'm looking forward is to receiving some early feedback, before going any further, if possible. Thanks!

alrocar commented 6 years ago

@rafatower thanks for taking this one! ❤️

Looks good so far, adding here just some suggestions to simplify the usage:

Could not be cleaner to include the file handling inside the copyfrom and copyto methods? I'm thinking in something like this:

result = copyClient.copyfrom(query, 'copy_from.csv')

copyClient.copyto(query, 'export.csv')

Where the read and write happens inside the methods.

For the shake of simplicity I would suggest two changes in the way both APIs are used.

I don't want you to spend more time than necessary on this, so we can add this as a FR for next version:

We could make the query parameter optional for the copyfrom endpoint and derive it from the csv itself. So anyone looking to upload a CSV shouldn't learn about the COPY SQL syntax.

You could just write something like:

result = copyClient.copyfrom('copy_from.csv')
# this will read the header of the csv, build the COPY SQL, open the file and call the `copyfrom` endpoint.

We could also wrap the CopySqlClient.copyto API inside the Dataset resource.

So to export a table you would write something like this:

DATASET_ID = "world_borders"
dataset_manager = DatasetManager(auth_client)
dataset = dataset_manager.get(DATASET_ID)
dataset.copyto("world_borders.csv") # This would export the dataset to "world_borders.csv". We have a name attribute in the `dataset` so the file name could be optional as well and just write  `dataset.copyto()`

rafatower commented 6 years ago

Two quick comments on your feedback:

I prefer to base the underlying API on streams/generators instead of always assuming a file. Think of the cases of transferring data from a foreign source or generating it
the query can be an arbitrary one. So instead of just exporting a table you could for instance export a subset or a calculation.

But of course your comments re. usability are perfectly valid and I'll give a try at accommodating it.

Thanks for the prompt answer!

El vie., 27 jul. 2018 19:30, Alberto Romeu notifications@github.com escribió:

@rafatower https://github.com/rafatower thanks for taking this one! ❤️

Looks good so far, adding here just some suggestions to simplify the usage:

Could not be cleaner to include the file handling inside the copyfrom and copyto methods? I'm thinking in something like this:

result = copyClient.copyfrom(query, 'copy_from.csv')

copyClient.copyto(query, 'export.csv')

Where the read and write happens inside the methods.

For the shake of simplicity I would suggest two changes in the way both APIs are used.

I don't want you to spend more time than necessary on this, so we can add this as a FR for next version:

We could make the query parameter optional for the copyfrom endpoint and derive it from the csv itself. So anyone looking to upload a CSV shouldn't learn about the COPY SQL syntax.

You could just write something like:

result = copyClient.copyfrom('copy_from.csv')

this will read the header of the csv, build the COPY SQL, open the file and call the copyfrom endpoint.

We could also wrap the CopySqlClient.copyto API inside the Dataset resource.

So to export a table you would write something like this:

DATASET_ID = "world_borders" dataset_manager = DatasetManager(auth_client) dataset = dataset_manager.get(DATASET_ID) dataset.copyto("world_borders.csv") # This would export the dataset to "world_borders.csv". We have a name attribute in the dataset so the file name could be optional as well and just write dataset.copyto()

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CartoDB/carto-python/pull/87#issuecomment-408486724, or mute the thread https://github.com/notifications/unsubscribe-auth/AB01pM0xfujNsSklsM-ZEZ-hj_RuiDjkks5uK04pgaJpZM4Vj34L .

rafatower commented 6 years ago

I tested manually with a script similar to the one in the /examples, but with a 1.2 GB input file.

execution logs

``` $ /usr/bin/time -v python copy_from_census_tracts.py 04:26:23 PM - INFO - Dropping, re-creating and cartodbfy'ing the table through the Batch API... 04:26:25 PM - INFO - Table created and cartodbfy'ed 04:26:25 PM - INFO - COPY'ing the table FROM the file... 04:29:25 PM - INFO - {u'total_rows': 73915, u'time': 179.017} 04:29:25 PM - INFO - Table populated Command being timed: "python copy_from_census_tracts.py" User time (seconds): 2.47 System time (seconds): 3.90 Percent of CPU this job got: 3% Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01.34 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 19176 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 3382 Voluntary context switches: 1698 Involuntary context switches: 4128 Swaps: 0 File system inputs: 2360664 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ```

The point here is that the peak memory allocated by the python script was less than 20 MB (Maximum resident set size (kbytes): 19176).

So, it does not need to read the full CSV file in memory, but requests reads from the file object, and sends chunks through the network.

rafatower commented 6 years ago

The way it works underneath (to me it was important to confirm this):

The file object is passed as data to the copyfrom method
The CopySQLClient uses the APIKeyAuthClient, that in turn uses the BaseAuthClient from pyrestcli
The call to self.client.send is actually calling self.session.request which uses the requests library
That calls requests.Session.requests which calls to requests.Session.send which delegates the call to an adapter, particularly the HTTPAdapter
The data is transferred in chunks here

As you can see, the only requirement for the data is to be an interator of some sort. There's also a test for it here. The request.body is populated here BTW.

rafatower commented 6 years ago

I did further tests and found out that it is not really streaming. It is sending the Content-Length header which disables the chunked transfer mode here.

This is a small piece of code but will need more time than I anticipated to get it right...

rafatower commented 6 years ago

The Content-Length header is set here: https://github.com/requests/requests/blob/5aeec8b661f992050b1d034e727d7ce0326e9a56/requests/models.py#L475-L500

(I'm figuring half of this by patching and debugging with pdb)

I think this is closer to what we should do: https://gist.github.com/nbari/7335384

but it is a bit tough to integrate in a nice, "safe" and "usable" way. I keep digging...

rafatower commented 6 years ago

I committed some sample code of how to send a chunked file through request and our library:

https://github.com/CartoDB/carto-python/pull/87/commits/3234f757bf01bb7c2cc217f7a966f2f742a7bf78#diff-99118b2871e39aceb0f505e083401249R103

It is meant to be eventually integrated in the sql module. But I'm having trouble. See the commit message: https://github.com/CartoDB/carto-python/pull/87/commits/3234f757bf01bb7c2cc217f7a966f2f742a7bf78 :cry:

rafatower commented 6 years ago

With that particular file, I may have this issue:

$ file /home/rtorre/src/cartodb-tests/sql_copy_perf/tl_2014_census_tracts.csv
/home/rtorre/src/cartodb-tests/sql_copy_perf/tl_2014_census_tracts.csv: ASCII text, with very long lines, with CRLF, LF line terminators

ERROR:  unquoted carriage return found in data
HINT:  Use quoted CSV field to represent carriage return.
CONTEXT:  COPY tl_2014_census_tracts, line 2

a mixture of line terminators.

PS: my fault, I appended the header manually. The trouble to mark here is that we should make sure clients are able to receive a meaninful error.

rafatower commented 6 years ago

With a curl client I get the error clearly in headers and response body:

< HTTP/1.1 100 Continue
< HTTP/1.1 400 Bad Request
< Server: openresty
< Date: Thu, 02 Aug 2018 10:53:44 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 112
< Connection: keep-alive
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Headers: X-Requested-With, X-Prototype-Version, X-CSRF-Token
< vary: Authorization
< Content-Disposition: inline
< X-SQLAPI-Profiler: {"authorization":1,"getConnectionParams":1,"finish":1563,"total":1565}
< X-SQLAPI-Errors: {"hint":"Use quoted CSV field to represent carriage return ","statusCode":400,"message":"unquoted carriage return found in data"}
* HTTP error before end of send, stop sending
< 
* Closing connection 0
{"error":["unquoted carriage return found in data"],"hint":"Use quoted CSV field to represent carriage return."}

shame on requests.

rafatower commented 6 years ago

This trick from stackoverflow fixes the issue with mixed newline styles:

tr -d '\15\32' < windows.txt > unix.txt

With the file fixed, now it works (this is for a 1.2 GB file):

 01:00:19 PM - INFO - Dropping, re-creating and cartodbfy'ing the table through the Batch API...
 01:00:21 PM - INFO - Table created and cartodbfy'ed
 01:00:21 PM - INFO - COPY'ing the table FROM the file...
 01:03:51 PM - INFO - {u'total_rows': 73915, u'time': 209.43}
 01:03:51 PM - INFO - Table populated

:smile:

rafatower commented 6 years ago

@javitonino and @juanignaciosl since you are the ones around I'm asking you for this review :pray:

Re. tests: most of the failing ones were broken before I touched this codebase :innocent: . But I created new E2E tests that do not work in travis because of AuthErrorException and the like. I'd appreciate any help in fixing those (I don't think I have permissions to check user, api keys and the rest of secrets).

They work in local with any api_key with enough privileges, both with python 2 and 3:

(env) [rtorre@fireflies carto-python (copy-endpoints-support)]$ python --version
Python 3.5.2
(env) [rtorre@fireflies carto-python (copy-endpoints-support)]$ py.test  tests/test_sql_copy.py
================================================================================================================================== test session starts ===================================================================================================================================
platform linux -- Python 3.5.2, pytest-3.7.1, py-1.5.4, pluggy-0.7.1
rootdir: /home/rtorre_unencrypted/src/carto-python, inifile:
plugins: requests-mock-1.5.2
collected 8 items                                                                                                                                                                                                                                                                        

tests/test_sql_copy.py ........                                                                                                                                                                                                                                                    [100%]

================================================================================================================================ 8 passed in 5.30 seconds ================================================================================================================================

(env2) [rtorre@fireflies carto-python (copy-endpoints-support)]$ python --version
Python 2.7.12
(env2) [rtorre@fireflies carto-python (copy-endpoints-support)]$ py.test tests/test_sql_copy.py
================================================================================================================================== test session starts ===================================================================================================================================
platform linux2 -- Python 2.7.12, pytest-3.7.1, py-1.5.4, pluggy-0.7.1
rootdir: /home/rtorre_unencrypted/src/carto-python, inifile:
plugins: requests-mock-1.5.2
collected 8 items                                                                                                                                                                                                                                                                        

tests/test_sql_copy.py ........                                                                                                                                                                                                                                                    [100%]

================================================================================================================================ 8 passed in 5.06 seconds ================================================================================================================================

About the other tests, I'm not really in a position of fixing them. Hopefully they fail because of the same perms issue.

My original intent was to release this branch this week.

Thanks in advance for the review and the help with the tests...

rafatower commented 6 years ago

I got a look at the nginx logs and the API key being used was deleted or re-generated. I'll do my best to fix it, but it is a bit of a PITA without access to the travis secrets. Any help is appreciated.

juanignaciosl commented 6 years ago

I'll take a look at this later, but I don't have access to the build settings at Travis. Maybe @alberhander can help with that.

rafatower commented 6 years ago

I got access and changed the master key in the config. In general tests do table creation/deletion and otherwise they get a 401 error. Let's see the results in the next round...

rafatower commented 6 years ago

Modesty aside, tests are cream of E2E.

juanignaciosl commented 6 years ago

PS: BTW, nice work with the tests! :muscle:

pramsey commented 6 years ago

Wonderful examples 👍

rafatower commented 6 years ago

I think I covered all the comments (thanks!) and now this is ready for release.

@juanignaciosl I'd do it myself but I can't yet, if you could help me... :hugs:

juanignaciosl commented 6 years ago

@rafatower ok, let me know when you're around and we'll pair this.

CartoDB / carto-python

Copy endpoints support #87

this will read the header of the csv, build the COPY SQL, open the file and call the `copyfrom` endpoint.

CartoDB / carto-python

Copy endpoints support #87

this will read the header of the csv, build the COPY SQL, open the file and call the copyfrom endpoint.

this will read the header of the csv, build the COPY SQL, open the file and call the `copyfrom` endpoint.