Closed yohplala closed 3 years ago
Although bytes seems to work, the key/values of the metadata are actually to be interpreted as (UTF8) strings. If you wanted to embed binary arrays, you would need to save as JSON list or base64-encode your binary.
Although bytes seems to work, the key/values of the metadata are actually to be interpreted as (UTF8) strings. If you wanted to embed binary arrays, you would need to save as JSON list or base64-encode your binary.
Hello Martin, A deep thanks for your great help! From your advices, I got across this SO answer which I find just great and allows then to add any 'pickle-compatible' object to metadata.
My own code becomes
import os
import pandas as pd
from fastparquet import write, ParquetFile
import pickle
import codecs
file = os.path.expanduser('~/Documents/code/data/test_fp')
# Dataset
ts = pd.DatetimeIndex([pd.Timestamp('2021/01/01 08:00'), pd.Timestamp('2021/01/01 07:00')])
td = [pd.Timedelta('0 days 20:00:00'), pd.Timedelta('0 days 45:00:00')]
df = pd.DataFrame({'a': [15.56798999e-10, 16], 'b':[0.3, 0.8], 'td': td, 'ts':ts})
In [9]: df
Out[9]:
a b td ts
0 1.556799e-09 0.3 0 days 20:00:00 2021-01-01 08:00:00
1 1.600000e+01 0.8 1 days 21:00:00 2021-01-01 07:00:00
# Encode last-but-one row and cache in metadata.
lbo_row = df.iloc[-2]
cm = {'cache': codecs.encode(pickle.dumps(lbo_row, protocol=pickle.HIGHEST_PROTOCOL), "base64").decode('latin1')}
write(file, df, custom_metadata=cm)
# Load data from cache, and check
pf = ParquetFile(file)
lbo_row = pf.key_value_metadata['cache']
lbo_row = pickle.loads(codecs.decode(lbo_row.encode('latin1'), "base64"))
# Check
lbo_row.equals(df.iloc[-2])
That'll do it :) Note that base64 will produce characters which are all ASCII, so you could use encoding "ascii" instead of "latin1", but it won't make any material difference.
Hi,
I would like to use the metadata to record the one but last row of my dataset (to avoid loading the dataset to work on this row that is specific in my use case).
To do so, I thought I could use the metadata, and becaure it allows to record bytes, I thought I would use numpy
tobytes()
for the job. The recording the step is ok.But then the loading step of the metadata does not go very well...
Error message:
Do you know if there is perhaps an 'easy' way of having this working? Thanks in advance for your advices. Bests