dataverbinders / statline-bq

Library to fetch CBS open datasets into parquet and optionally load into Google Cloud Storage and BigQuery
MIT License
0 stars 0 forks source link

use dask builtin methods to create parquet files #50

Closed galamit86 closed 3 years ago

galamit86 commented 3 years ago

Currently, in order to convert from a dask bag to a parquet file, we:

  1. Write the bag to multiple json files
  2. Concat all json files into 1 ndjson file
  3. Use pyarrow to read ndjson and create a pyarrow table
  4. Write the pyarrow table to a parquet file.

using dask's chained builtin methods as such: bag.to_dataframe().to_parquet(), can replace all this process with a single line of code.

To check:

  1. How do the 2 processes compare in terms of time
  2. The output of bag.to_dataframe().to_parquet() is a folder with possibly multiple parquet files inside (00.in BQ should be alte, 01.part, etc.) The GCS upload and link to BQ should be altered accordingly if used.
galamit86 commented 3 years ago

closed by #60