emory-libraries / dlp-curate

Digital curation and preservation workbench for the Emory Preservation Repository.
11 stars 4 forks source link

Revise output of metadata-only export CSVs #2036

Closed eporter23 closed 1 year ago

eporter23 commented 1 year ago

After testing basic export capabilities for metadata, we notice that the exported CSV structure is different than our importer CSV structure.

We want to ensure that Curate users can use these exported CSVs to do metadata cleanup and then re-import them later.

Current issues:

bwatson78 commented 1 year ago

Preliminary Findings:

eporter23 commented 1 year ago

@bwatson78 thanks for the reminder about file vs files, I always forget. With file_type it's fine with me if we store it for FileSet objects, but I also expect that would require a lot of reindexing. To avoid reindexing ~300K filesets, I guess we could incorporate that into any export-related workflows so that we reindex selected ids prior to exporting.

eporter23 commented 1 year ago

With the pcdm_use situation, I think that Zizia (or maybe Curate) defaults that to Primary Content unless otherwise specified. I'm pretty sure Zizia only requires that it be populated if it's something different like Supplemental Content or Supplemental Preservation File. All that to say, I think it's okay if it's not populated because we can infer that it's Primary Content unless there's a value there.

eporter23 commented 1 year ago

@bwatson78 I just tested the revised Visibility output and it is much improved. There are two settings that I wanted to ask about. Is it possible to adjust these? Public currently outputs as Open Emory High Download currently outputs as Authenticated.

bwatson78 commented 1 year ago

This is where we run into the issue with multiple values being assigned to multiple keys. Unfortunately, for both, these are the first keys that the system encounters when provided with the value.

eporter23 commented 1 year ago

Initial testing was looking great for the multi-valued fields, but I just now exported a portion of a book that had been imported by Bulkrax (if that has any significance - the other exports had not been). I'm seeing date_created and date_created_1as well as holding_repository and holding_repository_1. Will send you the CSV in Slack so you can see.

eporter23 commented 1 year ago

@bwatson78 for those 2 visibility values, do you know if the importer would accept those and translate them correctly? As in if we exported and fed it back in, would Open be interpreted as "Public"?

bwatson78 commented 1 year ago

Yes, they would.

eporter23 commented 1 year ago

@bwatson78 All of the items in this ticket are looking great.