adamviola / parquet-explorer

Explore Parquet files with SQL
MIT License
28 stars 5 forks source link

Feature Request: allow editing the parquet file by editing the cells in the viewer #13

Closed DeflateAwning closed 6 months ago

adamviola commented 6 months ago

AFAIK the Parquet format was not designed to be mutated with individual edits. Each edit to the Parquet file would require re-generating the entire file, which is problematic for large files. As a result, I'm not planning to tackle editing unless anyone has ideas to make this more feasible. Closing for now

DeflateAwning commented 6 months ago

I think they're right, but I don't think that's a hard limitation. Rewriting the entire Parquet isn't that crazy, especially for small parquets (which are realistically the only ones you'd edit by hand). For example, we have a few tables in our data lake that we update manually with 100 rows that this would be perfect for.

Proposed Solution

One proposed solution:

  1. Read the file like how you do currently. In the read, also add a "row_number" column in the query to store interally (from the entire file though, not the row number within the result set).
  2. Allow user to modify cells as desired.
  3. When the user saves, use the Polars dataframe library's streaming functionality to load the dataframe, perform the modification, and write the file to a new temporary file in the same directory as the original one.
  4. Rename the temporary file to the original filename, overwriting it.

The Polars dataframe library is written in Rust, and is intended mostly for use in Python. For this, it probably makes the most sense to write the actual transformer in Rust to avoid extra languages. It's a very, very simple script.

Implementation Notes

Curious to hear what you think! I may be interested in helping develop this at some point if you're onboard with it!

DeflateAwning commented 6 months ago

Any thoughts or follow-up? I'd be interested in leaving this issue open in case someone wants to come along and contribute!

adamviola commented 6 months ago

Sounds reasonable! The only part I can't quite figure out is how to handle the fact that the SQL query could involve JOINs, renamed/calculated columns, GROUP BY, etc., which would make mapping edits back to the original file difficult. Curious about your thoughts on this!

DeflateAwning commented 6 months ago

You could use a SQL parser to check and see if it contains any JOINs, GROUP BY, etc. (i.e., anything other than selecting from the original table), and just block editing if that's the case. Anything is better than nothing, so just reject any edge case for now.

Here's an example of a viable parser, albeit in Python: https://github.com/tobymao/sqlglot

Regardless, perhaps we could re-open this issue as a potential feature for the future?