Feature Request: allow editing the parquet file by editing the cells in the viewer

adamviola commented 6 months ago

AFAIK the Parquet format was not designed to be mutated with individual edits. Each edit to the Parquet file would require re-generating the entire file, which is problematic for large files. As a result, I'm not planning to tackle editing unless anyone has ideas to make this more feasible. Closing for now

DeflateAwning commented 6 months ago

I think they're right, but I don't think that's a hard limitation. Rewriting the entire Parquet isn't that crazy, especially for small parquets (which are realistically the only ones you'd edit by hand). For example, we have a few tables in our data lake that we update manually with 100 rows that this would be perfect for.

Proposed Solution

One proposed solution:

Read the file like how you do currently. In the read, also add a "row_number" column in the query to store interally (from the entire file though, not the row number within the result set).
Allow user to modify cells as desired.
When the user saves, use the Polars dataframe library's streaming functionality to load the dataframe, perform the modification, and write the file to a new temporary file in the same directory as the original one.
Rename the temporary file to the original filename, overwriting it.

The Polars dataframe library is written in Rust, and is intended mostly for use in Python. For this, it probably makes the most sense to write the actual transformer in Rust to avoid extra languages. It's a very, very simple script.

Implementation Notes

It's important to be able to edit a cell after a filter is applied. Thus, the entire file needs to be read/re-written when modified.
Only allow basic datatype modification for now (i.e., str, int, but not list, struct, etc.).
In the future, use dropdowns for enum/categorical columns.
Warn the user if saving a parqet with a size >100MB, probably.
Warn the user if there isn't enough disk space to save a duplicate of the file.
Warn the user if Polars doesn't support the same compression as the original.

Curious to hear what you think! I may be interested in helping develop this at some point if you're onboard with it!

DeflateAwning commented 6 months ago

Any thoughts or follow-up? I'd be interested in leaving this issue open in case someone wants to come along and contribute!

adamviola commented 6 months ago

Sounds reasonable! The only part I can't quite figure out is how to handle the fact that the SQL query could involve JOINs, renamed/calculated columns, GROUP BY, etc., which would make mapping edits back to the original file difficult. Curious about your thoughts on this!

DeflateAwning commented 6 months ago

You could use a SQL parser to check and see if it contains any JOINs, GROUP BY, etc. (i.e., anything other than selecting from the original table), and just block editing if that's the case. Anything is better than nothing, so just reject any edge case for now.

Here's an example of a viable parser, albeit in Python: https://github.com/tobymao/sqlglot

Regardless, perhaps we could re-open this issue as a potential feature for the future?

adamviola / parquet-explorer

Feature Request: allow editing the parquet file by editing the cells in the viewer #13

Proposed Solution

Implementation Notes