Revise output of metadata-only export CSVs

emory-libraries / dlp-curate

Digital curation and preservation workbench for the Emory Preservation Repository.

11 stars 4 forks source link

Revise output of metadata-only export CSVs #2036

Closed eporter23 closed 1 year ago

eporter23 commented 1 year ago

After testing basic export capabilities for metadata, we notice that the exported CSV structure is different than our importer CSV structure.

We want to ensure that Curate users can use these exported CSVs to do metadata cleanup and then re-import them later.

Current issues:

FileSet rows are not populated
file column does not exist, file IDs are instead stored in a childrencolumn. Ideally we want to export the original filenames and not the ID for the fileset.
file_type is not present
pcdm_use is not present
All multi-valued fields' headers get relabeled with *_1 even if there are no entries populated.
Multivalued fields with values are split into single columns storing one value each and the headers are relabeled with _1 , _2 etc.
the visibility field values should translate back to our labels (e.g. Private, Emory High Download, Public Low Res etc.)

bwatson78 commented 1 year ago

Preliminary Findings:

FileSets will populate. I think the problem with them not appearing is that the Limit on some of the Test exports were set smaller than the number or Works in that Collection, and since Works and FileSets are processed as CsvEntries (Works first, then FileSets), the FileSets weren't even touched.
file is present, but, of course, it's using that pesky _1 numbering system. Just a reminder: our Bulkrax importing won't operate on files, only file.
file_type was originally intended as a throwaway field that was used to help process the files into the right mounter. I had no clue that exporting was in our horizon. Is it okay if we make it a stored attribute in FileSet objects, @eporter23 ? Otherwise, I'd have to generate it dynamically everytime "on the way out."
Some FSs have pcdm_use, which is exposing a bug I think our Bulkrax importing has--FSs are being saved with that field unpopulated. I will have to dig into the cause of that soon.

eporter23 commented 1 year ago

@bwatson78 thanks for the reminder about file vs files, I always forget. With file_type it's fine with me if we store it for FileSet objects, but I also expect that would require a lot of reindexing. To avoid reindexing ~300K filesets, I guess we could incorporate that into any export-related workflows so that we reindex selected ids prior to exporting.

eporter23 commented 1 year ago

With the pcdm_use situation, I think that Zizia (or maybe Curate) defaults that to Primary Content unless otherwise specified. I'm pretty sure Zizia only requires that it be populated if it's something different like Supplemental Content or Supplemental Preservation File. All that to say, I think it's okay if it's not populated because we can infer that it's Primary Content unless there's a value there.

eporter23 commented 1 year ago

@bwatson78 I just tested the revised Visibility output and it is much improved. There are two settings that I wanted to ask about. Is it possible to adjust these? Public currently outputs as Open Emory High Download currently outputs as Authenticated.

bwatson78 commented 1 year ago

This is where we run into the issue with multiple values being assigned to multiple keys. Unfortunately, for both, these are the first keys that the system encounters when provided with the value.

eporter23 commented 1 year ago

Initial testing was looking great for the multi-valued fields, but I just now exported a portion of a book that had been imported by Bulkrax (if that has any significance - the other exports had not been). I'm seeing date_created and date_created_1as well as holding_repository and holding_repository_1. Will send you the CSV in Slack so you can see.

eporter23 commented 1 year ago

@bwatson78 for those 2 visibility values, do you know if the importer would accept those and translate them correctly? As in if we exported and fed it back in, would Open be interpreted as "Public"?

bwatson78 commented 1 year ago

Yes, they would.

eporter23 commented 1 year ago

@bwatson78 All of the items in this ticket are looking great.