OvertureMaps / schema

Overture Maps Schema
http://docs.overturemaps.org/schema
Creative Commons Attribution 4.0 International
143 stars 6 forks source link

Not declared in schema `division_ids`, `is_disputed`, etc shows up in the data schema, missing `division_id` in divisions #298

Open Azbesciak opened 1 week ago

Azbesciak commented 1 week ago

Hello, the mentioned field division_ids and is_disputed is declared only for type division_boundary https://github.com/OvertureMaps/schema/blob/5b33a0236e389793e2dc5a8cd55725a1feb2ec2f/schema/divisions/division_boundary.yaml#L33

Also looks like there is a division_id field, but it is not declared in the divisions.yml schema However turns out it also is present in division and division_areas, even if not declared there. Please fix the schema. As you see in edits history there are other cases (as I see later also perspectives is missing in areas definition, for instance) Runtime schema for Division

root
 |-- id: string (nullable = true)
 |-- geometry: geometry (nullable = true)
 |-- country: string (nullable = true)
 |-- sources: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- property: string (nullable = true)
 |    |    |-- dataset: string (nullable = true)
 |    |    |-- record_id: string (nullable = true)
 |    |    |-- update_time: string (nullable = true)
 |    |    |-- confidence: double (nullable = true)
 |-- subtype: string (nullable = true)
 |-- names: struct (nullable = true)
 |    |-- primary: string (nullable = true)
 |    |-- common: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- rules: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- variant: string (nullable = true)
 |    |    |    |-- language: string (nullable = true)
 |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- between: array (nullable = true)
 |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |-- side: string (nullable = true)
 |-- wikidata: string (nullable = true)
 |-- division_ids: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- is_disputed: boolean (nullable = true)
 |-- perspectives: struct (nullable = true)
 |    |-- mode: string (nullable = true)
 |    |-- countries: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- local_type: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- region: string (nullable = true)
 |-- hierarchies: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- division_id: string (nullable = true)
 |    |    |    |-- subtype: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |-- parent_division_id: string (nullable = true)
 |-- norms: struct (nullable = true)
 |    |-- driving_side: string (nullable = true)
 |-- population: integer (nullable = true)
 |-- capital_division_ids: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- capital_of_divisions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- division_id: string (nullable = true)
 |    |    |-- subtype: string (nullable = true)
 |-- division_id: string (nullable = true)

For Division area - the difference is class field, comparing to division.

root
 |-- id: string (nullable = true)
 |-- geometry: geometry (nullable = true)
 |-- country: string (nullable = true)
 |-- sources: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- property: string (nullable = true)
 |    |    |-- dataset: string (nullable = true)
 |    |    |-- record_id: string (nullable = true)
 |    |    |-- update_time: string (nullable = true)
 |    |    |-- confidence: double (nullable = true)
 |-- subtype: string (nullable = true)
 |-- class: string (nullable = true)
 |-- names: struct (nullable = true)
 |    |-- primary: string (nullable = true)
 |    |-- common: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- rules: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- variant: string (nullable = true)
 |    |    |    |-- language: string (nullable = true)
 |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- between: array (nullable = true)
 |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |-- side: string (nullable = true)
 |-- wikidata: string (nullable = true)
 |-- division_ids: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- is_disputed: boolean (nullable = true)
 |-- perspectives: struct (nullable = true)
 |    |-- mode: string (nullable = true)
 |    |-- countries: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- local_type: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- region: string (nullable = true)
 |-- hierarchies: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- division_id: string (nullable = true)
 |    |    |    |-- subtype: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |-- parent_division_id: string (nullable = true)
 |-- norms: struct (nullable = true)
 |    |-- driving_side: string (nullable = true)
 |-- population: integer (nullable = true)
 |-- capital_division_ids: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- capital_of_divisions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- division_id: string (nullable = true)
 |    |    |-- subtype: string (nullable = true)
 |-- division_id: string (nullable = true)
Azbesciak commented 1 week ago

release 24-11-13.0, it was also in the previous

vcschapp commented 1 week ago

Discussing in Schema TF meeting today, seems like it's a central data pipeline bug. Pipeline is currently merging all top-level properties for any feature type in the theme at the theme level.

Best result would be bug fix in data pipeline: Parquet files should only define columns for the types they contain.

vcschapp commented 6 days ago

Shared this with @ibnt1 and #tf-data-platform.

ibnt1 commented 6 days ago

i checked and the parquet files come like this into the theme promote - that is with same schema for all types.

the current schema validation process at central pipeline only checks that the data types for the columns in the each theme-type pair in the parquet to be promoted matches the data type in the reference parquet schema, no matter if that column is defined for a given type or not, so it essentially keeps all columns from the upstream theme parquet if their types are correct.

we should revisit that and perhaps enforce the allowed columns per type, probably by managing a separate reference schema for each type, as i remember there was at least one more other time in the past where this could have caught a similar issue, but that would be a large-ish work item. created work item for this: https://github.com/OvertureMaps/tf-data-platform/issues/780

but independent of that it should be first fixed in the upstream divisions pipeline - @DavidKarlas - can you please look into that?