Closed braindevices closed 12 months ago
this is a very important feature, when we have to write out big dataset by fragments. I cannot find any info about this in your document.
a simple test show it has the potential:
import daft df_test1 = daft.from_pydict({'a': [1., 2], 'b': [{'k1': 0, 'k2':1}, {'k3': 'ok'}]}) df_test2 = daft.from_pydict({'a': [1., None], 'b': [{'k1': None}, {'k3': None}]}) df_test3 = daft.from_pydict({'a': [1., None], 'b': [{'k1': None}, {'k3': None}], 'c': [20, 12]}) pq_dir = "/tmp/test2.pq" for _i, _df in enumerate([df_test1, df_test2, df_test3]): _df.write_parquet(pq_dir) df = daft.read_parquet(pq_dir) df.to_pydict()
however, you seems do not support _common_metadata and _metadata Thus there is not useful stats data.
Also when dataset is supper big, we usually would like to use some filter to limit the data reading to certain column/row based on the stats.
It seems like daft still lack this kind of ability.
I think you're commenting on the wrong project! We make probabilistic graphical models around here.
this is a very important feature, when we have to write out big dataset by fragments. I cannot find any info about this in your document.
a simple test show it has the potential:
however, you seems do not support _common_metadata and _metadata Thus there is not useful stats data.
Also when dataset is supper big, we usually would like to use some filter to limit the data reading to certain column/row based on the stats.
It seems like daft still lack this kind of ability.