Open sarahyurick opened 1 month ago
I believe the reason we have a custom read_json implementation is the ability to specify files_per_partition
and combine multiple files into a single read_json call from cudf which isn't supported in dask dataframe. Since parquet
and a few others have support for many params ootb, it makes sense to mimic dask in the parquet case.
Right now,
DocumentDataset
has a couple ofread_*
functions: (1)(2)
(3)
It would be good if these functions could support Dask's read_json and read_parquet parameters (there is no
read_pickle
function in Dask but we can perhaps look to Pandas for this).In addition to this, we can restructure our
to_*
functions as well.