digital-land / technical-documentation

Technical Documentation for the planning data service.
https://digital-land.github.io/technical-documentation/index.html
0 stars 0 forks source link

Speed up dataset building #128

Open Ben-Hodgkiss opened 1 month ago

Ben-Hodgkiss commented 1 month ago

Overview Building data sets incorporating using Parquet files (read using DuckDB). Need to apply code to base, remembering both to incorporate information about the fields and datatype (e.g. making NULL rather than) Will need to manually test, before testing and deploying

Pull Request(PR): https://github.com/digital-land/digital-land-python/pull/265

Tech Approach A bullet pointed list with details on how this could be technically worked.

Acceptance Criteria/Tests

Resourcing & Dependencies

alexglasertpx commented 2 weeks ago

Currently have about 15 geometries that are different from the SQLite code and the DuckDB/Parquet code. Some are due to some minor differences at the eight decimal point, but most are very different, due to multiple different geometries for the same, e.g. priority, entity, entry_date. Currently investigating how SQLite and DusckDB chooses which geometry to use.

alexglasertpx commented 1 week ago

The problem can be broken down to 2 issues; different facts from the same resource and the same facts from different resources. For the former it appears as if DuckDB has taken a fact at random (which sometimes matched that from SQLite, so the number of differences changed each run). We have now added a further sorting is done as follows:

  1. Highest priority
  2. Latest entry date
  3. Highest entry number
  4. First resource alphabetically
  5. First fact alphabetically

This has meant that there are a few differences between the geometries when using the SQLite method, however they have higher entry numbers so can be justified.

alexglasertpx commented 6 days ago

Added acceptance and integration tests.