dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
787 stars 178 forks source link

Question - Save custom metadata information #343

Closed guilhermecgs closed 5 years ago

guilhermecgs commented 6 years ago

Hi Folks,

I have a pandas dataframe that I want to save in a single parquet file.

This dataframe have some custom attributes/metadata, like ID, address, name, last name, age, that are specific to my application .

Is there a clever way to persist these information alongside with the actual data?

Do I need to overwrite some method to do this?

martindurant commented 6 years ago

Function fastparquet.writer.make_metadata creates the global metadata, and populates it with a canonical version of the pandas schema. In the line key_value_metadata=[meta], meta is a parquet_thrift.KeyValue(), you can add any additional key/value pairs to the list.

guilhermecgs commented 6 years ago

ok, understood.. thanks for the info

What do you think would be the best approach?

  1. Submit a merge request to add a new parameter in the writer method that receives a function to create the custom metadata?

ex:

write(filename, data, row_group_offsets=50000000, compression=None, file_scheme='simple', open_with=default_open, mkdirs=default_mkdirs, has_nulls=True, write_index=None, partition_on=[], fixed_text=None, append=False, object_encoding='infer', times='int64', _custom_make_metadatafn=None):

  1. Is there a way to save the file using the writer "as is", and the append the custom metadata in a second step? maybe this:

foot_size = write_thrift(f, fmd)

martindurant commented 6 years ago

The second method is certainly possible, probably less work and would not be intrusive to the existing code. It isn't "clean", though, and becomes complicated in the multi-file situation.

The write function has ever so many parameters, but I guess yet another one would be OK. Why a function, though, rather than the metadata values up front, presumably as a dictionary?

2-5 commented 6 years ago

I also need this.

I first try monkey patching make_metadata, but that method failed because the computed metadata is ignored if the file is opened in append mode - the metadata is just copied from the existing file.

Eventually I just lifted the code which writes the metadata:

fmd = ParquetFile(path).fmd
for metadata in fmd.key_value_metadata:
    if metadata.key == "index":
        metadata.value = content
        break
else:
    metadata = parquet_thrift.KeyValue()
    metadata.key = "index"
    metadata.value = content
    fmd.key_value_metadata.append(metadata)

# based on fastparquet/writer.py : write_simple
with open(path, "rb+") as f:
    f.seek(-8, 2)
    head_size = struct.unpack("<i", f.read(4))[0]
    f.seek(-(head_size + 8), 2)
    foot_size = write_thrift(f, fmd)
    f.write(struct.pack("<i", foot_size))
    f.write(MARKER)

I suggest adding another argument to the write function, user_metadata={}. There is a slight issue here, what to do if different metadata is passed when append=True than when the file was originally opened. I suggest overwriting the old metadata keys with the new ones. This is required for my use case - I append an index metadata to the file, which documents the key range covered by each row group (which surprisingly is not already provided by parquet, at least I couldn't find it), and I need to update it everytime I write a new row group, in a separate run.

martindurant commented 6 years ago

Your method sounds reasonable, and I agree that in the case of append, updating the values with new ones sounds right. I may cause a mismatch between metadata in the new files and _metadata file versus old files (in the multi-file case).

It sounds like you can be in the best position to produce a PR?

syagev commented 5 years ago

Was this closed on the account of being implemented or something else? Is there a suggested method to write the custom metadata without modifying fastparquet's code? If not, maybe keep open and prioritize? Seems to me like this is a useful feature of the parquet standard.

P.S. - thanks for the hard work on this project + integration with dask - tremendous value in both packages!

martindurant commented 5 years ago

This was closed as "unlikely to be implemented by me", but please, if you have some time, have a go. I do not imagine it would be hard, since we already directly manipulate the parquet metadata object for storing the pandas dataframe description.