Open Ben-Hodgkiss opened 3 weeks ago
Currently have about 15 geometries that are different from the SQLite code and the DuckDB/Parquet code. Some are due to some minor differences at the eight decimal point, but most are very different, due to multiple different geometries for the same, e.g. priority, entity, entry_date. Currently investigating how SQLite and DusckDB chooses which geometry to use.
The problem can be broken down to 2 issues; different facts from the same resource and the same facts from different resources. For the former it appears as if DuckDB has taken a fact at random (which sometimes matched that from SQLite, so the number of differences changed each run). We have now added a further sorting is done as follows:
This has meant that there are a few differences between the geometries when using the SQLite method, however they have higher entry numbers so can be justified.
Overview Building data sets incorporating using Parquet files (read using DuckDB). Need to apply code to base, remembering both to incorporate information about the fields and datatype (e.g. making NULL rather than) Will need to manually test, before testing and deploying
Pull Request(PR):
Tech Approach A bullet pointed list with details on how this could be technically worked.
Package
class structuredataset_create
command to use parquet and duckdbAcceptance Criteria/Tests
Resourcing & Dependencies