OpenGeoMetadata / metadata-issues

Working space for metadata issues, development, and discussions
Apache License 2.0
2 stars 0 forks source link

Define minimal crosswalk strategy for resource class and type #50

Closed thatbudakguy closed 1 month ago

thatbudakguy commented 1 year ago

Opening this issue to discuss the possibility of a minimal or "standard" conversion between the controlled vocabularies for

...when migrating from v1 to Aardvark.

Known implementations for this behavior are:

Some questions we could answer that I think would help unify the (currently diverging) implementations:

  1. What should happen when a record has an invalid value for dc_type_s or layer_geom_type_s? Lots of records "in the wild" seem to not completely obey these vocabularies.
  2. Which values are direct conversions? For example, "Dataset" in dc_type_s is definitely the same as "Datasets" in gbl_resourceClass_sm.
  3. Which values are reasonable to infer? For example, can "Image" in layer_geom_type_s be "Raster data" in gbl_resourceType_sm?
thatbudakguy commented 1 year ago

current behavior

resource class

dc_type_s gbl-1_to_aardvark gbl2aardvark.js V1AardvarkMigrator
Collection Collection Collections Collections
Dataset Dataset Datasets Datasets
Image Image Imagery Imagery
Interactive Resource Interactive Resource Websites Websites
Physical Object Physical Object Other Other
Service Service EDIT ME -- this record had dc_type_s = Service Web services
Still Image Still Image EDIT ME -- this record had dc_type_s = Still Image Imagery
other value other value EDIT ME -- this record had dc_type_s = other value Other

resource type

layer_geom_type_s gbl-1_to_aardvark gbl2aardvark.js V1AardvarkMigrator
Point Point Point data Point data
Line Line Line data Line data
Polygon Polygon Polygon data Polygon data
Image Image EDIT ME -- this record had layer_geom_type_s = Image no value
Raster Raster Raster data Raster data
Mixed Mixed EDIT ME -- this record had layer_geom_type_s = Mixed no value
Table Table Table data Table data
other value other value EDIT ME -- this record had layer_geom_type_s = other value no value
kgjenkins commented 1 year ago

For Resource Class, the DCMI definition for "Still Image" mentions maps, so I don't think it will be possible to automatically convert "Still Image" to either "Imagery" or "Maps" (at least without trying to glean information from other fields like the title or description).

For Resource Type, I'm a hesitant to convert "Image" to "Raster data", since the image is most likely an aerial image or some form of map, and the recommended Resource Type terms includes specific terms like "Aerial photographs", "Bathymetric maps", "Cadastral maps", "Fire insurance maps", "Nautical charts", "Topographic maps", and many other types of maps. To me "Raster data" is numeric data like elevation, precipitation, or temperature -- not imagery that could be further processed to extract data.

thatbudakguy commented 1 year ago

These are both good points. re: resource type, I think you're right that something like a (scanned) sanborn map is an Image, but isn't (raster) Data (unless you do some more processing to it).

re: resource class, the definition for "Still Image" says:

Instances of the type Still Image must also be describable as instances of the broader type Image.

so, if we convert "Image" to Imagery, it seems like "Still Image" must necessarily be Imagery (but isn't necessarily Maps). does that make sense?

kgjenkins commented 1 year ago

I'm not finding any records with "dc_type_s":"Still Image" in any of the OGM repos (using the github search within the OGM organization), so maybe we don't need to worry about that value. Not finding any "Physical Object" records either.

thatbudakguy commented 1 year ago

I agree that in practice it probably will rarely happen, but I was thinking this issue could be a place to decide the "official" strategy for any value that might occur — so any valid dc_type_s as well as other random values (which definitely show up in our data).

The assumption underlying that is that other folks looking to migrate from v1 to Aardvark will be in a situation where they don't have time to manually correct or postprocess all the records after conversion, and just want a converted version that is "the least wrong". So, converting everything in one go and knowing exactly how the fields will be mapped (regardless of which converter/implementation you use) would be useful. But maybe other folks aren't in this situation or don't have the same constraints?

kgjenkins commented 1 year ago

Yes, we should probably just pick a behavior and document it. It would be good if the process could also output a list of warnings (for things such as this), that folks could choose to follow up on (or not).

thatbudakguy commented 1 year ago

Updated the behavior table above to reflect the changes in https://github.com/OpenGeoMetadata/GeoCombine/pull/143/commits/f30e6d19a08720bbc24f8b7b70df6bb0742d4fd9.

karenmajewicz commented 1 month ago

@thatbudakguy does this need any more work or can we close the issue?

thatbudakguy commented 1 month ago

I think this can be considered implemented in recent versions of GeoCombine (and in other libraries mentioned above), so closing.