output schemas - Githubissues

bcgov / FIT_changedetector

Compare two sets of geodata, reporting on various types of differences

Apache License 2.0

2 stars 0 forks source link

output schemas #12

Closed smnorris closed 2 weeks ago

smnorris commented 4 weeks ago

include only columns with changes in diffs
retain source column order in output schemas

This makes review much simpler for datasets with many columns.

andershopperstead commented 3 weeks ago

The column filtering improvement you implemented is great for the MODIFIED_ATTR and MODIFIED_BOTH output layers, but the change has also impacted the DELETED and NEW layers, which is not helpful unfortunately. Please show all columns for DELETED and NEW. My apologies for not making this clear in the request.

smnorris commented 3 weeks ago

no problem, this was my assumption.

smnorris commented 2 weeks ago

@andershopperstead - one clarification, what should the schema be for MODIFIED_GEOM ? There are no changed attributes, so I'd presume just retain the primary key/hash key? Or retain all 'original' attributes?

NEW - 'new' schema DELETED - 'original' schema UNCHANGED - 'original' schema MODIFIED_BOTH - changed columns only MODIFIED_ATTR - changed columns only MODIFIED_GEOM - ?

smnorris commented 2 weeks ago

Also, SHAPE_AREA/SHAPE_LENGTH etc etc are now ignored when comparing attributes, but should they still be included in the NEW / DELETED / UNCHANGED outputs? (I'm defaulting to simply dropping them)

andershopperstead commented 2 weeks ago

Great clarification requests. We would like the 'original' schema for MODIFIED_GEOM please.

With regards to the system fields like SHAPE_LENGTH, they are not editable for users and therefore they can be dropped from any outputs.

Thank you!

smnorris commented 2 weeks ago

Is retaining the original name of the geometry column important? Currently we default to calling it geometry in all outputs, regardless of what it is called in the source (ie SHAPE, GEOMETRY, etc).

I'm thinking for NEW, DELETED, UNCHANGED using the source column name makes sense and is easy to apply. For the MODIFIED_ outputs the source name would be fine too, but in instances where they do not match between sources we'd have to pick a name. I'm not sure which would be better.

andershopperstead commented 2 weeks ago

The name of the geometry column is not important. Our tools allow us to copy geometry between features without referencing a field name, so using a generic default to support the functionality of the tool is perfect.

smnorris commented 2 weeks ago

Another assumption I've been making - for a given record in both sources, addition of new columns in the new dataset is not detected as a change.

For example, an original record with properties {id=1, pet="dog"} matching new record {id=1, pet="dog", colour="grey"} would result in a row in UNCHANGED having values {id=1, pet="dog"}. ~~Hopefully this works, making the comparison is trickier otherwise~~ Writing the new record with all source columns to UNCHANGED would be simple if preferred. But detecting it as a change would require more thought, would it be MODIFIED_ATTR? If so, would the resulting row be colour_original=NULL, colour_new="grey"?

andershopperstead commented 2 weeks ago

Yes - your assumption works. Tracking the addition of data in new columns is beyond the scope of this tool.