OpenGeoMetadata / metadata-issues

Working space for metadata issues, development, and discussions
Apache License 2.0
2 stars 0 forks source link

Come up with strategy for upgrading Is Part Of field values #43

Open karenmajewicz opened 1 year ago

karenmajewicz commented 1 year ago

One of the main incompatibilities between Metadata 1.0 and Aardvark is the Is Part Of field. In the 1.0, this was a string value. In Aardvark, this is an ID that is read by the GeoBlacklight application to link records together.

To upgrade, users would need to create new collection records for each unique value and replace the strings with the new IDs.

Pros:

Cons:

karenmajewicz commented 1 year ago

Is Part Of

kgjenkins commented 1 year ago

The metadata converter at https://kgjenkins.github.io/gbl2aardvark/ will now automatically create new "Collections" records, using information from all the existing child records. Some of the fields (subject, keyword, etc.) aggregate all the unique values found in the child records, and the bbox (dcat_bbox, locn_geometry) is automatically expanded to include all the child record bboxes.

I've documented the process a bit in the README

I think this could be a viable approach, although one would certainly want to review the new collection records -- the descriptions will certainly need editing to better reflect the whole collection. And you may not really want every placename from all the child records to be listed in the collection record.

Date values may also require clean-up -- the script keeps every unique value (which works well for single years in gbl_indexYear_im) but dct_temporal_sm may have things like this:

   "dct_temporal_sm": [
      "1998-2013",
      "1998-2014",
      "1998-2015",
      "1998-2016",
      "1999-2013",
      "1999-2014", etc.

The collection records may also reveal spelling or capitalization inconsistencies in the child records. For example:

   "dct_subject_sm": [
      "Land Cover",
      "Land Use",
      "Land cover",
      "Land use",
      "Tree canopy", etc.

Of course, it could be nice to retain a "simple" collection field that just contains a string (similar to subject or keyword), but also have the option of the new relations-based dct_isPartOf_sm field.

karenmajewicz commented 1 year ago

In this case, dct_isPartOf_sm probably maps better to pcdm_memberOf_sm.

From the OGM documentation: Is Part Of: To link items that are a subset of another item (e.g. a page in a book) Member Of: To link items that are part of a collection

thatbudakguy commented 1 year ago

Another possible strategy that is supported by https://github.com/OpenGeoMetadata/GeoCombine/pull/143 is to assume that it's possible to get a list of all collection records (in v1 format) before attempting the conversion from v1 to Aardvark. In Earthworks, we apparently use a layer_geom_type_s of "Collection" to indicate collections (which might not be valid in v1, but that's another story). You can export all the Collection records this way by making a query to solr.

Once you have a list of collection records and their layer_slug_s, you can make any kind of structured data (JSON directly from solr, CSV, etc.), and then parse it and pass it into the converter:

id_map = {
  'My Collection 1' => 'institution:my-collection-1',
  'My Collection 2' => 'institution:my-collection-2'
}

GeoCombine::Migrators::V1AardvarkMigrator.new(v1_hash: record, collection_id_map: id_map).run

This way, you can convert all records (including collections) at the same time:

An interesting and debatably useful side-effect of this is that it collapses collections with the same name into a single collection. While testing out this strategy, I discovered that several collections in Earthworks are duplicated, probably accidentally. The "2010 China province population census data with GIS maps" collection has this version, with only one member, and this version with several members. While it's possible to have collections with the same name, it doesn't seem desirable from a user standpoint, so using this strategy is an easy way to consolidate duplicate collections at the same time you convert to Aardvark.