dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
763 stars 175 forks source link

Can't read back a row saved as bytes in metadata (ok... maybe that is not the point of metadata...) #610

Closed yohplala closed 3 years ago

yohplala commented 3 years ago

Hi,

I would like to use the metadata to record the one but last row of my dataset (to avoid loading the dataset to work on this row that is specific in my use case).

To do so, I thought I could use the metadata, and becaure it allows to record bytes, I thought I would use numpy tobytes() for the job. The recording the step is ok.

import os
import pandas as pd
from fastparquet import write, ParquetFile
import numpy as np

file = os.path.expanduser('~/Documents/code/data/test_fp')
df = pd.DataFrame({'a': [15, 16], 'b':[0.3, 0.8]})
# My very custom metadata
cm = {'row-2':df.iloc[-2].to_numpy().tobytes()}
write(file, df, custom_metadata=cm)

But then the loading step of the metadata does not go very well...

pf = ParquetFile(file)

Error message:

In [38] pf = ParquetFile(file)
Traceback (most recent call last):

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet/api.py", line 172, in _parse_header
    fmd = read_thrift(f, parquet_thrift.FileMetaData)

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet/thrift_structures.py", line 25, in read_thrift
    obj.read(pin)

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py", line 1932, in read
    iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 14: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "<ipython-input-38-bb7e08ad0d7f>", line 1, in <module>
    pf = ParquetFile(file)

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet/api.py", line 120, in __init__
    self._parse_header(f, verify)

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet/api.py", line 174, in _parse_header
    raise ParquetException('Metadata parse failed: %s' %

ParquetException: Metadata parse failed: /home/pierre/Documents/code/data/test_fp

Do you know if there is perhaps an 'easy' way of having this working? Thanks in advance for your advices. Bests

martindurant commented 3 years ago

Although bytes seems to work, the key/values of the metadata are actually to be interpreted as (UTF8) strings. If you wanted to embed binary arrays, you would need to save as JSON list or base64-encode your binary.

yohplala commented 3 years ago

Although bytes seems to work, the key/values of the metadata are actually to be interpreted as (UTF8) strings. If you wanted to embed binary arrays, you would need to save as JSON list or base64-encode your binary.

Hello Martin, A deep thanks for your great help! From your advices, I got across this SO answer which I find just great and allows then to add any 'pickle-compatible' object to metadata.

My own code becomes

import os
import pandas as pd
from fastparquet import write, ParquetFile
import pickle
import codecs

file = os.path.expanduser('~/Documents/code/data/test_fp')
# Dataset
ts = pd.DatetimeIndex([pd.Timestamp('2021/01/01 08:00'), pd.Timestamp('2021/01/01 07:00')])
td = [pd.Timedelta('0 days 20:00:00'), pd.Timedelta('0 days 45:00:00')]
df = pd.DataFrame({'a': [15.56798999e-10, 16], 'b':[0.3, 0.8], 'td': td, 'ts':ts})
In [9]: df
Out[9]: 
              a    b              td                  ts
0  1.556799e-09  0.3 0 days 20:00:00 2021-01-01 08:00:00
1  1.600000e+01  0.8 1 days 21:00:00 2021-01-01 07:00:00
# Encode last-but-one row and cache in metadata.
lbo_row = df.iloc[-2]
cm = {'cache': codecs.encode(pickle.dumps(lbo_row, protocol=pickle.HIGHEST_PROTOCOL), "base64").decode('latin1')}
write(file, df, custom_metadata=cm)

# Load data from cache, and check
pf = ParquetFile(file)
lbo_row = pf.key_value_metadata['cache']
lbo_row = pickle.loads(codecs.decode(lbo_row.encode('latin1'), "base64"))
# Check
lbo_row.equals(df.iloc[-2])
martindurant commented 3 years ago

That'll do it :) Note that base64 will produce characters which are all ASCII, so you could use encoding "ascii" instead of "latin1", but it won't make any material difference.