Closed guilhermecgs closed 5 years ago
Function fastparquet.writer.make_metadata
creates the global metadata, and populates it with a canonical version of the pandas schema. In the line key_value_metadata=[meta]
, meta
is a parquet_thrift.KeyValue()
, you can add any additional key/value pairs to the list.
ok, understood.. thanks for the info
What do you think would be the best approach?
ex:
write(filename, data, row_group_offsets=50000000, compression=None, file_scheme='simple', open_with=default_open, mkdirs=default_mkdirs, has_nulls=True, write_index=None, partition_on=[], fixed_text=None, append=False, object_encoding='infer', times='int64', _custom_make_metadatafn=None):
foot_size = write_thrift(f, fmd)
The second method is certainly possible, probably less work and would not be intrusive to the existing code. It isn't "clean", though, and becomes complicated in the multi-file situation.
The write
function has ever so many parameters, but I guess yet another one would be OK. Why a function, though, rather than the metadata values up front, presumably as a dictionary?
I also need this.
I first try monkey patching make_metadata
, but that method failed because the computed metadata is ignored if the file is opened in append mode - the metadata is just copied from the existing file.
Eventually I just lifted the code which writes the metadata:
fmd = ParquetFile(path).fmd
for metadata in fmd.key_value_metadata:
if metadata.key == "index":
metadata.value = content
break
else:
metadata = parquet_thrift.KeyValue()
metadata.key = "index"
metadata.value = content
fmd.key_value_metadata.append(metadata)
# based on fastparquet/writer.py : write_simple
with open(path, "rb+") as f:
f.seek(-8, 2)
head_size = struct.unpack("<i", f.read(4))[0]
f.seek(-(head_size + 8), 2)
foot_size = write_thrift(f, fmd)
f.write(struct.pack("<i", foot_size))
f.write(MARKER)
I suggest adding another argument to the write
function, user_metadata={}
. There is a slight issue here, what to do if different metadata is passed when append=True
than when the file was originally opened. I suggest overwriting the old metadata keys with the new ones. This is required for my use case - I append an index metadata to the file, which documents the key range covered by each row group (which surprisingly is not already provided by parquet, at least I couldn't find it), and I need to update it everytime I write a new row group, in a separate run.
Your method sounds reasonable, and I agree that in the case of append, updating the values with new ones sounds right. I may cause a mismatch between metadata in the new files and _metadata file versus old files (in the multi-file case).
It sounds like you can be in the best position to produce a PR?
Was this closed on the account of being implemented or something else? Is there a suggested method to write the custom metadata without modifying fastparquet's code? If not, maybe keep open and prioritize? Seems to me like this is a useful feature of the parquet standard.
P.S. - thanks for the hard work on this project + integration with dask - tremendous value in both packages!
This was closed as "unlikely to be implemented by me", but please, if you have some time, have a go. I do not imagine it would be hard, since we already directly manipulate the parquet metadata object for storing the pandas dataframe description.
Hi Folks,
I have a pandas dataframe that I want to save in a single parquet file.
This dataframe have some custom attributes/metadata, like ID, address, name, last name, age, that are specific to my application .
Is there a clever way to persist these information alongside with the actual data?
Do I need to overwrite some method to do this?