CartoDB / carto-python

CARTO Python client
https://carto.com
BSD 3-Clause "New" or "Revised" License
154 stars 62 forks source link

Add support for copyto/copyfrom #83

Closed alrocar closed 6 years ago

rafatower commented 6 years ago

Working on this PR: https://github.com/CartoDB/carto-python/pull/87

rafatower commented 6 years ago

COPY endpoints support

I decided to write some lines about what I plan to do in order to give some background, scope this feature and justify the decisions behind the code.

Intro and background

We deployed to production a new feature aimed at efficient streaming of data to and from CARTO, by using the PostgreSQL COPY statement.

You can find more information about the feature and endpoints here.

One important thing to note is that the raison d'être of this feature is to provide a low level interface that does not get in the middle of the COPY and that let's clients stream data.

With it, we hope to give an answer to some of the pains that our customers and partners have experienced with other API’s:

  • low speed and lack of control over the process
  • inability to specify types, separators, encodings, etc.
  • need of intermediate formats or services to upload to CARTO

You can find more information in this and other tickets, describing the pains we tried to solve and why we chose to implement the COPY SQL statement.

Why implementing a client in the CARTO Python SDK?

As you can see in the documentation and examples, implementing a client for the COPY endpoints is relatively straightforward.

Nevertheless, we'd like to have a ready-to-use reference client that can be used from day zero to stream to and from CARTO, and that can be used as building block for other projects (think of CARTOFrames, for instance).

Requirements

IMHO these should be the only basic requirements:

I'm not opposed to higher level interfaces, but they should build on top of this simple and thin foundation.

FAQ

Q: Why not accept files as inputs?

A higher level python interface can accept files, but the building block should be the stream. Otherwise we'll miss one of the key points of it.

Q: Why not read the CSV header and generate the table out of it?

The CSV format does not contain enough information to accurately generate a table out of it, particularly types. That is what the Import Guessing tries to do with different degrees of success. Instead of guessing we want to have it explicit, which is what our users and partners demanded.

Q: Why the query has to be specified? could it be generated automatically?

In some cases it can be generated automatically, but we don't want to impose a limitation in the public interface. To give you some examples from the PostgreSQL COPY documentation: the file format is not restricted to csv, quoting, delimiters and encoding can be specified as part of the sentence, and the copyto can retrieve data from an arbitrary subquery.

Q: When you say stream, what are you talking about?

Conceptually: anything that can be streamed. At the network level: anything that can be sent over the wire in small bits through chunked transfer encoding. At a higher, more pythonic level, and talking more specifically about requests: anything that requests is able to POST in chunks. The following lay in that category: generators, iterators and streams (to be checked).

alrocar commented 6 years ago

Thanks @rafatower for the scoping 👍

I'm not opposed to higher level interfaces, but they should build on top of this simple and thin foundation.

Seems good to me. Then the current PR https://github.com/CartoDB/carto-python/pull/87 should be mostly fine. Let me know when you remove the [WIP] tag

rafatower commented 6 years ago

Alright. I'm targeting the next week to have everything ready and released. That's why I wanted to identify the MVP, so to speak. I'm focusing on making sure this thing actually streams (and will try to write some tests for it).

Then I'll try to add support for compression, time permitting, which is nice-to-have but very desirable. At least the interface should account for it in the first version, even if it means that it does not actually use it yet.

Thanks!