Speed up dataset building

Ben-Hodgkiss commented 1 month ago

Overview Building data sets incorporating using Parquet files (read using DuckDB). Need to apply code to base, remembering both to incorporate information about the fields and datatype (e.g. making NULL rather than) Will need to manually test, before testing and deploying

Pull Request(PR): https://github.com/digital-land/digital-land-python/pull/265

Tech Approach A bullet pointed list with details on how this could be technically worked.

take the scratch Python files and convert into proper Package class structure
update the dataset_create command to use parquet and duckdb
create unit tests and integration tests
check image with data management
deploy updated code (preferably into Airflow environment)

Acceptance Criteria/Tests

sqlite created via this method is the same in datasette and investigate differences
comparison of speeds taken (in dev or prod, wherever it will be deployed)

Resourcing & Dependencies

this could impact the data management environment as following packages required:
- parquet
- arrow
- pyarrow
- duckdb
- Might also require click
data engineer - could need support from DevOps

alexglasertpx commented 2 weeks ago

Currently have about 15 geometries that are different from the SQLite code and the DuckDB/Parquet code. Some are due to some minor differences at the eight decimal point, but most are very different, due to multiple different geometries for the same, e.g. priority, entity, entry_date. Currently investigating how SQLite and DusckDB chooses which geometry to use.

alexglasertpx commented 1 week ago

The problem can be broken down to 2 issues; different facts from the same resource and the same facts from different resources. For the former it appears as if DuckDB has taken a fact at random (which sometimes matched that from SQLite, so the number of differences changed each run). We have now added a further sorting is done as follows:

Highest priority
Latest entry date
Highest entry number
First resource alphabetically
First fact alphabetically

This has meant that there are a few differences between the geometries when using the SQLite method, however they have higher entry numbers so can be justified.

alexglasertpx commented 6 days ago

Added acceptance and integration tests.

digital-land / technical-documentation

Speed up dataset building #128