pandas dataframe - Githubissues

kernelmachine commented 7 years ago

How would you recommend converting Records to a pandas dataframe?

Also, what do you think about giving fetch the option of returning a dataframe instead of list? There might be performance concerns.

jettify commented 7 years ago

What is your use case? Sync driver + sqlalchemy usually is more then sufficient for such things, you even have it for free http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html

On Fri, 5 Aug 2016 at 13:26 Suchin notifications@github.com wrote:

How would you recommend converting Records to a pandas dataframe?

Also, what do you think about giving fetch the option of returning a dataframe instead of list? There might be performance concerns.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MagicStack/asyncpg/issues/17, or mute the thread https://github.com/notifications/unsubscribe-auth/AANoZ9YrtEJwJj72uUlP2dMZrv6ErQ5Gks5qc3JjgaJpZM4Jd5HD .

1st1 commented 7 years ago

I'm also curious to hear about the use cases. Also, does Pandas provide an async API?

We can add an API to register a custom row-decoder callback (function accepting bytes and returning decoded row). That would make it possible for Pandas to implement their own decoding to any kind of data-types.

wesm commented 7 years ago

Efficiently decoding data to the exact memory representation that pandas requires is currently somewhat complicated. It uses NumPy arrays as its internal representation, but null-handling adds some complexity

float32/float64 columns — normal np.float32/float64 arrays with NaN for nulls
any integer columns: for no nulls, use int8/16/32/64, otherwise use float64 (with nulls encoded as NaN). Users should be warned about data loss for int64 values outside the representable range with floating point numbers
boolean columns: numpy.bool for no nulls, and numpy.object if there are nulls (encode null as None)
string columns: numpy.object_ dtype with PyString/PyUnicode objects internally. Use None to encode nulls. It is preferable to avoid creating two copies of the same string (see https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h#L193 where I deal with this problem for HiveServer2)
datetime columns: nanoseconds since the UNIX epoch as numpy.int64 (logically: datetime64[ns]) array

The business logic for producing these arrays is best written in C/C++/Cython with limited involvement of the Python C API (only where numpy.object_ arrays are involved). I am happy to help advise on the implementation / do some code review, the C++ code in https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h can form a reference for what pandas expects in the NumPy arrays (it does not do timestamp conversion — depending on PG's datetime representation you may be able to perform the conversion in C or leave it to pandas's internal timestamp methods to convert).

pandas does not have any async APIs at the moment AFAIK

wesm commented 7 years ago

As a possible added benefit for pandas users, it would be nice to have the option for any string columns to be returned as pandas categorical dtype, which has significant performance benefits in analytics

1st1 commented 7 years ago

@wesm I think we can built-in this functionality into asyncpg -- instead of Records we can return a list of columns (numpy arrays, with the decoding semantics you've outlined). Would that be enough to integrate pandas?

Also, would it be possible to provide a benchmark that we can tweak to use asyncpg and work with?

wesm commented 7 years ago

Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.)

We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.)

1st1 commented 7 years ago

Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.)

I actually wanted to make an optional dependency. Let's say hide this functionality behind a argument to the connect function (return_numpy_dataframes=False).

We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.)

That is something we need to have before we can start the process (we aren't Pandas users ourselves).

wesm commented 7 years ago

That is something we need to have before we can start the process (we aren't Pandas users ourselves).

It sounds like what you need is a test suite, not benchmarks, am I interpreting that right?

1st1 commented 7 years ago

It sounds like what you need is a test suite, not benchmarks, am I interpreting that right?

Sorry, I should have clarified my request. I wanted to ask for a small script that uses one type (say int32), fetches some data from the DB and performs some rudimentary calculation. We than could use that script to prototype the implementation and see how it compares to existing solutions. In any case, never mind, I think can jot a simple Pandas script myself.

mangecoeur commented 6 years ago

I was interested in attempting to implement this - I did some naive benchmarks using asyncpg to load large tables instead of psycopg and I saw about 3x speed improvement. I suspect it could be possible to load data into pandas tables even faster by generating numpy arrays directly in the cython layer. Any pointers on where to start poking around would be welcome - I am aiming to just produce some simple benchmark to start with, to see if its worth pursuing lower level integration (as opposed to just using the normal asyncpg interface).

(Edit: context, I actually wrote quite a bit of the pandas sqlalchemy based SQL interface but i was never too satisfied with performance and type support).

elprans commented 6 years ago

@mangecoeur Interface-wise, it would make most sense to integrate with Connection.copy_*. Basically, the algorithm for "copyfrom*_to_pandas" would look something like this:

Introspect the source relation, and to get the types of the attributes.
Initialize the dataframe accordingly.
Build a codec profile (see _ensure_rows_decoder), using decoder versions that produce pandas native format as output.
Initiate COPY OUT ... (FORMAT binary), and decode the data chunks into the target dataframe.

Step number 3 is the tricky part. For this whole thing to make sense performance-wise, the decoding pipeline must not dip into the interpreter loop, thus the decoders must be plugged in as C function pointers (maybe as a special argument to copy*()).

wesm commented 6 years ago

From a data representation point of view, it may also be worth looking at Apache Arrow as an intermediary to use -- then you don't have to deal with all the weird pandas stuff and just deal with strongly-typed nullable columns. The turbodbc folks have been using Arrow for data en route to/from pandas and that's been working very well cc @mathmagique @xhochy

mangecoeur commented 6 years ago

Thanks for the pointers, i will try to implement some really simple cases just to see what the performance is like, if it's promising i could follow @wesm suggestion and integrate with Arrow (although turbodbc looks a bit baroque wrt mixing C++ and python, compared to cython only).

(Edit: I notice that there is a WIP cython api for arrow)

xhochy commented 6 years ago

The main thing you will benefit from in Arrow is that you get a simple construction of the columnar buffers using the Builder classes. The transformation from the Arrow structures to a Pandas DataFrame is then taken care of from the Arrow side. As Arrow is simpler structured then Pandas, the implementation is much simpler, still very efficient.

API-wise, I have used Arrow with Cython, boost::python and pybind11. If you're fluent in C++ and Python, then pybind11 is the best option from my perspective as you get the best of both worlds. Cython is the more natural choice when you come from a Python-only perspective but limiting you in the use of the C++ as not all C++ features are supported in Cython. Neverthelesse, in Arrow, we want to support all three options and it should be simple to to use with Cython. Always happy to help you to get it working independent of the wrapper ;)

deepers commented 4 years ago

I would definitely be interested in being able to return the results of a query as a sequence of NumPy arrays representing the columns of the result. I don't use pandas very much, but I do use NumPy, and I would love an API that allows the user to pass in column dtypes and null handling instructions.

jedmitten commented 4 years ago

I generally use

res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res])

foxx commented 4 years ago

I generally use

res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res])

This didn't work for me, I had to use:

dfs = pd.DataFrame([dict(r.items()) for r in results ])

RiccardoGiro commented 3 years ago

Any news/updates on this enhancement?

0x0L commented 2 years ago

Hi all,

I tried to implement a naive version of @elprans proposal. Here's what I came up with: https://gist.github.com/0x0L/3d55d47f6329eb0d7e7a46d1c895b5be

A few remarks:

the cython bindings for array builders are a bit shady: default ctor are missing, no reset method
I couldn't find a canonical way to get an array builder from a given datatype - which looks strange to me. Am I missing something? @wesm
I couldn't reused py-pgproto codecs as they spit out python objects and I didn't want the overhead.

I don't know what the timings at the bottom are worth but at least it doesn't look bad.

wesm commented 2 years ago

We haven't invested too much in the Cython API for the array builders, so using the C++ API (in Cython) would be the way to go (if Cython is the right implementation language).

I couldn't find a canonical way to get an array builder from a given datatype - which looks strange to me. Am I missing something? @wesm

We have the arrow::MakeBuilder function in C++ https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_base.h#L280

0x0L commented 2 years ago

@wesm Thanks! Makebuilder is exactly what I was looking for In terms of inter-op, a generic C++ lib would probably be a much better candidate here given the narrow and well-defined scope: convert pq binary buffers to table or recordbatch of a given schema. At least for primitive types, the schema translation from sql to arrow looks straightforward.

0x0L commented 2 years ago

I switched to c++. Writing python bindings will be quite easy considering how simple the interface is.

It supports list, composite types and enums Still a WIP but you can check it out at https://github.com/0x0L/pg2arrow

It basically tries to do the same as https://github.com/heterodb/pg2arrow Since we're using arrow api the code is much simpler

Also looking for advices on how to improve my ugly c++ pattern

wesm commented 2 years ago

hi @0x0L, great news. I'm interested to see if there is a way to create a small "C middleware library" in Apache Arrow that uses the C data interface to exchange data structures:

https://arrow.apache.org/docs/format/CDataInterface.html

The idea would be to have some C code that provides a minimal Arrow builder API along with a minimal C implementation of this data interface, so downstream applications don't necessarily need to use C++ or link to the Arrow C++ library.

cc @pitrou for thoughts

pitrou commented 2 years ago

I'd like to hear if people have a problem with using Arrow C++ here. Since only a minimal build of Arrow C++ should be required, you can probably easily link statically to libarrow.a, and there shouldn't be any additional dependencies.

0x0L commented 2 years ago

Hi all.

I finally got a product that should be usable enough for others to test. You can have a look at https://github.com/0x0L/pgeon

Any feedback or help would be greatly welcomed :)

vmarkovtsev commented 2 years ago

I also thought about this in my blog post and will try to hack an alternative production-ready deserializer to numpy recarrays. My requirements are:

Drop-in replacement for asyncpg interface. I plan to prefix the query with :rocket: to activate the columnar mode.
Handle nulls well. Our production needs different strategies for nulls: remove rows; coerce to constant; fallback to object array.
Play nice with custom types such as JSON and HSTORE. Support them and not crash, that is.
Do not convert strings to fixed S/U, because the memory can explode. Find a way to inform the internals whether such a conversion must be done.
Of course, there will be no intermediate protocol parsing to Record-s. That's the whole point :smile:
No third-party dependencies except numpy.

Let's see how much of the original native asyncpg code will be left and how faster it will work.

vmarkovtsev commented 2 years ago

Done. Blog post Repository: https://github.com/athenianco/asyncpg-rkt

As I wrote in the post, merging back would require some effort, so I want to check my opportunities with the maintainers.

eirnym commented 2 years ago

I'd love to see it merged back, if Numpy will be an optional dependency

ale-dd commented 1 year ago

@vmarkovtsev it would be awesome if you were able to merge

MagicStack / asyncpg

pandas dataframe #17