emory-libraries / dlp-curate

Digital curation and preservation workbench for the Emory Preservation Repository.
11 stars 4 forks source link

Bulkrax: add full Curate metadata to CSV parser #1937

Closed eporter23 closed 1 year ago

eporter23 commented 2 years ago

For our initial prototypes we added Curate's required metadata only. To continue our assessment of Bulkrax we should add the remaining metadata fields.

For reference, see the Zizia importer field list. https://curate-test.library.emory.edu/importer_documentation/guide

bwatson78 commented 2 years ago

PR made: https://github.com/emory-libraries/dlp-curate/pull/1952

bwatson78 commented 2 years ago

All mapped fields: ["abstract", "access_restriction_notes", "administrative_unit", "author_notes", "conference_dates", "conference_name", "contact_information", "content_genres", "content_type", "contributors", "copyright_date", "creator", "data_classifications", "data_collection_dates", "data_producers", "data_source_notes", "date_created", "date_digitized", "date_issued", "deduplication_key", "edition", "emory_ark", "emory_rights_statements", "extent", "file", "file_types", "final_published_versions", "geographic_unit", "grant_agencies", "grant_information", "holding_repository", "institution", "internal_rights_note", "isbn", "issn", "issue", "keywords", "legacy_rights", "local_call_number", "model", "notes", "other_identifiers", "page_range_end", "page_range_start", "parent", "parent_title", "pcdm_use", "place_of_production", "primary_language", "primary_repository_ID", "publisher", "publisher_version", "re_use_license", "related_datasets", "related_material_notes", "related_publications", "rights_documentation", "rights_holders", "rights_statement", "scheduled_rights_review", "scheduled_rights_review_note", "sensitive_material", "sensitive_material_note", "series_title", "source_collection_id", "sponsor", "staff_notes", "subject_geo", "subject_names", "subject_time_periods", "subject_topics", "sublocation", "system_of_record_ID", "table_of_contents", "technical_note", "title", "transfer_engineer", "uniform_title", "visibility", "volume"]

All multivalued fields: ["access_restriction_notes", "content_genres", "contributors", "creator", "data_classifications", "data_collection_dates", "data_producers", "data_source_notes", "emory_ark", "emory_rights_statements", "file", "file_types", "final_published_versions", "grant_agencies", "grant_information", "keywords", "notes", "other_identifiers", "related_datasets", "related_material_notes", "related_publications", "rights_holders", "rights_statement", "staff_notes", "subject_geo", "subject_names", "subject_time_periods", "subject_topics", "title"]

All parsed fields: ["administrative_unit", "content_type", "data_classifications", "pcdm_use", "publisher_version", "re_use_license", "rights_statement", "sensitive_material", "title", "visibility"]

eporter23 commented 2 years ago

@bwatson78 pretty good so far! A few bugs while testing: Filesets seem to be getting created in duplicates - see example Multi-valued fields are outputting the literal "|" as part of the value and not getting split into separate values - see same example. This work was generated by this importer. Another importer behaved similarly.

eporter23 commented 2 years ago

@bwatson78 In testing the full set of our metadata fields, I notice that sometimes fields supporting multiple values don't always list them in the order they were entered. In this work, the following fields list their entries out of order:

data_collection_dates
subject_names
subject_topics

contributors does display the entries in correct order in this work.

Here is the importer CSV showing the original field values.

eporter23 commented 2 years ago

Re-running this set of metadata on another work (and adding a few more multi-valued entries) results in a different order for the above fields - seems to be kind of random.

eporter23 commented 2 years ago

This is working great with the latest revisions.