use dask builtin methods to create parquet files - Githubissues

dataverbinders / statline-bq

Library to fetch CBS open datasets into parquet and optionally load into Google Cloud Storage and BigQuery

MIT License

0 stars 0 forks source link

use dask builtin methods to create parquet files #50

Closed galamit86 closed 3 years ago

galamit86 commented 3 years ago

Currently, in order to convert from a dask bag to a parquet file, we:

Write the bag to multiple json files
Concat all json files into 1 ndjson file
Use pyarrow to read ndjson and create a pyarrow table
Write the pyarrow table to a parquet file.

using dask's chained builtin methods as such: bag.to_dataframe().to_parquet(), can replace all this process with a single line of code.

To check:

How do the 2 processes compare in terms of time
The output of bag.to_dataframe().to_parquet() is a folder with possibly multiple parquet files inside (00.in BQ should be alte, 01.part, etc.) The GCS upload and link to BQ should be altered accordingly if used.

galamit86 commented 3 years ago

closed by #60