Closed DeflateAwning closed 6 months ago
I think they're right, but I don't think that's a hard limitation. Rewriting the entire Parquet isn't that crazy, especially for small parquets (which are realistically the only ones you'd edit by hand). For example, we have a few tables in our data lake that we update manually with 100 rows that this would be perfect for.
One proposed solution:
The Polars dataframe library is written in Rust, and is intended mostly for use in Python. For this, it probably makes the most sense to write the actual transformer in Rust to avoid extra languages. It's a very, very simple script.
Curious to hear what you think! I may be interested in helping develop this at some point if you're onboard with it!
Any thoughts or follow-up? I'd be interested in leaving this issue open in case someone wants to come along and contribute!
Sounds reasonable! The only part I can't quite figure out is how to handle the fact that the SQL query could involve JOINs, renamed/calculated columns, GROUP BY, etc., which would make mapping edits back to the original file difficult. Curious about your thoughts on this!
You could use a SQL parser to check and see if it contains any JOINs, GROUP BY, etc. (i.e., anything other than selecting from the original table), and just block editing if that's the case. Anything is better than nothing, so just reject any edge case for now.
Here's an example of a viable parser, albeit in Python: https://github.com/tobymao/sqlglot
Regardless, perhaps we could re-open this issue as a potential feature for the future?
AFAIK the Parquet format was not designed to be mutated with individual edits. Each edit to the Parquet file would require re-generating the entire file, which is problematic for large files. As a result, I'm not planning to tackle editing unless anyone has ideas to make this more feasible. Closing for now