Currently, in order to convert from a dask bag to a parquet file, we:
Write the bag to multiple json files
Concat all json files into 1 ndjson file
Use pyarrow to read ndjson and create a pyarrow table
Write the pyarrow table to a parquet file.
using dask's chained builtin methods as such: bag.to_dataframe().to_parquet(), can replace all this process with a single line of code.
To check:
How do the 2 processes compare in terms of time
The output of bag.to_dataframe().to_parquet() is a folder with possibly multiple parquet files inside (00.in BQ should be alte, 01.part, etc.) The GCS upload and link to BQ should be altered accordingly if used.
Currently, in order to convert from a dask bag to a parquet file, we:
pyarrow
to read ndjson and create a pyarrow tableusing dask's chained builtin methods as such:
bag.to_dataframe().to_parquet()
, can replace all this process with a single line of code.To check:
bag.to_dataframe().to_parquet()
is a folder with possibly multiple parquet files inside (00.in BQ should be alte
,01.part
, etc.) The GCS upload and link to BQ should be altered accordingly if used.