Refactors tables_to_parquet to use a small memory footprint, by writing each page containing 10K (or 100K for v4) records to disk, and concatenating them together as a Parquet file using a stream writer. Main changes are:
The Dask Bag object (representing a single table in a dataset) is created by mapping url_to_ndjson which returns a Path to an ndjson file instead of the previous load_from_url which returned a dict object.
The concatenating is done using pyarrow's ParquetWriter, allowing a single ndjson file to be loaded to memory, converted to Parquet, and appended to a different, single file.
Fixes a bug in the processing of v4 datasets (wrong pagination - per 10k instead of 100k in utils.generate_table_urls
Updates config file
Updates README file
Techincally:
Adds:
utils.get_schema_cbs()
Removes:
utils. get_odata()
utils.convert_table_to_parquet()
utils.get_from_meta()
Updates:
utils.load_from_url() into utils.url_to_ndjson()
utils. generate_table_urls() - bug fix
Additional functions with minor changes
Not implemented
v4 support for get_schema_cbs() - requesting the metadata url returns a 406 error, and should be examined (issue #59 opened). Currently circumventing by using the schema from the first page (considering these are 100k each, and a long format means less columns, the chances of error are much lower)
Full translation of OData types to pyarrow types (issue #61 opened).
This PR:
Conceptually:
tables_to_parquet
to use a small memory footprint, by writing each page containing 10K (or 100K for v4) records to disk, and concatenating them together as a Parquet file using a stream writer. Main changes are:Dask Bag
object (representing a single table in a dataset) is created by mappingurl_to_ndjson
which returns aPath
to anndjson
file instead of the previousload_from_url
which returned adict
object.ParquetWriter
, allowing a singlendjson
file to be loaded to memory, converted to Parquet, and appended to a different, single file.utils.generate_table_urls
config
fileREADME
fileTechincally:
utils.get_schema_cbs()
utils. get_odata()
utils.convert_table_to_parquet()
utils.get_from_meta()
utils.load_from_url()
intoutils.url_to_ndjson()
utils. generate_table_urls()
- bug fixNot implemented
get_schema_cbs()
- requesting the metadata url returns a 406 error, and should be examined (issue #59 opened). Currently circumventing by using the schema from the first page (considering these are 100k each, and a long format means less columns, the chances of error are much lower)OData
types topyarrow
types (issue #61 opened).