Closed jonashaag closed 1 year ago
I'm curious why you're re-batching in Python instead of having pyarrow do the batching for you?
Good point, I observed a small speedup when using columnar backend (Citus/Hydra) vs sending a single Parquet batch per COPY
. But honestly I haven't rigorously benchmarked this so could be useless.
Another thing I wasn't sure about is what happens if you submit a huge Parquet file in a single COPY
. I think it will lead to a huge WAL. Not sure what the side effects of this are.
The example in the README uses pyarrow to stream batches. So you never hold the entire file in memory and can use pyarrow to rebatch it to whatever number of rows best fits your database.
Oh, I actually didn't realize I could just use that even though I'm not using Hive style Parquet datasets 🙈
How about the WAL growing too large though? Could that become a problem?
In any case there's probably nothing useful that my recipe adds over your example. Would be nice to make a CLI tool from this at some point -- I imagine that getting Parquet into Postgres quickly is one of the primary uses of this library.
My recipe fails to recognize VARCHAR columns as such and always uses TEXT. Is that the case with your example as well?
How about the WAL growing too large though? Could that become a problem?
I don't think so. I've used binary copies to load some pretty large datasets.
My recipe fails to recognize VARCHAR columns as such and always uses TEXT. Is that the case with your example as well?
I'm not sure what you mean by this but pgpq
uses TEXT for String
columns because TEXT
is always the right choice in Postgres.
Would be nice to make a CLI tool from this at some point -- I imagine that getting Parquet into Postgres quickly is one of the primary uses of this library.
Agreed! I tried this at one point and ran out of time. Ideally it would be written in Rust so it can ship as a single binary without Python.
TEXT is always the right choice in Postgres
TIL! Not the case in most other DBMS.
How about the WAL growing too large though? Could that become a problem?
I don't think so. I've used binary copies to load some pretty large datasets.
In my case the WAL grew to a couple of 10 GB, but as long as you have enough space maybe it's not a problem.
Closing this because my recipe is not any better than what's provided already
Maybe this is useful for some other people.