daft-dev / daft

Render probabilistic graphical models using matplotlib
https://docs.daft-pgm.org
MIT License
675 stars 120 forks source link

do you support write row group to existing dataset? #179

Closed braindevices closed 9 months ago

braindevices commented 9 months ago

this is a very important feature, when we have to write out big dataset by fragments. I cannot find any info about this in your document.

a simple test show it has the potential:

import daft
df_test1 = daft.from_pydict({'a': [1., 2], 'b': [{'k1': 0, 'k2':1}, {'k3': 'ok'}]})
df_test2 = daft.from_pydict({'a': [1., None], 'b': [{'k1': None}, {'k3': None}]})
df_test3 = daft.from_pydict({'a': [1., None], 'b': [{'k1': None}, {'k3': None}], 'c': [20, 12]})
pq_dir = "/tmp/test2.pq"

for _i, _df in enumerate([df_test1, df_test2, df_test3]):
    _df.write_parquet(pq_dir)

df = daft.read_parquet(pq_dir)
df.to_pydict()

however, you seems do not support _common_metadata and _metadata Thus there is not useful stats data.

Also when dataset is supper big, we usually would like to use some filter to limit the data reading to certain column/row based on the stats.

It seems like daft still lack this kind of ability.

dfm commented 9 months ago

I think you're commenting on the wrong project! We make probabilistic graphical models around here.