eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

metadata only items #35

Closed photomedia closed 2 years ago

photomedia commented 2 years ago

Currently, there is an issue that causes a transfer error in Archivematica for any items exported using the plugin that are metadata-only (no document/upload of any kind).
What happens in these cases, for the export, is that only metadata is exported, and we end up with a folder structure that has an "objects"->"documents" and "derivatives" folders that are empty, and an empty "checksum.md5" file in "metadata". The export contains only metadata.json, EP3.xml and "revision" XML files. This fails at the transfer stage of Archivematica, with an error at the "verify checksum" stage. That's because there is no checksums included in the file checksum file - seeing as there are no "objects" to checksum.

We need to decide how to deal with this at the plugin level. The argument can be made that metadata-only records don't belong in Archivematica, as it's metadata only. If that's the way forward, I suggest that we add a condition in the plugin and for those eprints with no documents, we export them to a different folder location (set as an option in the cfg.d, $c->{archivematica}->{metadata_only_path}). This means that they would simply get skipped over by archivematica automation tools (as automation tools is typically monitoring the main {path}), but nevertheless, the metadata would get exported out, perhaps to be stored along with any of the "logs" from the preservation batch jobs (create_transfer, process_transfer, touch_transfer). That would mean that we would have to add another option for this additional location, and "process" the metadata only onto there. In the AM dataset, the actions would be logged, but the record would not receive an "Archived" status with an AM UUID for the AIP. If ever a document was added to the eprint, then it would make its way to the main $c->{archivematica}->{path}.

The other option would be to attempt to modify the transfer in a way that would get Archivematica to accept/ingest an AIP with no "objects"; but I don't think that's a good idea.

Any comments on that? I would like to resolve this issue, as I think the plugin is ready for a release once this is resolved.

jb4 commented 2 years ago

I think the plugin should be changed so it can, and does by default, skip over metadata only records. However, what do Archivmatica think, do they already ingest metadata only records in a different way?

        Justin

-- Justin Bradley Strategy & Technical Lead EPrints Serviceshttps://eprints.org/

WAIS / ECS / University of Southampton @.**@.> / B32 1023

From: Tomasz Neugebauer @.> Date: Wednesday, 8 December 2021 at 19:51 To: eprintsug/EPrintsArchivematica @.> Cc: Subscribed @.***> Subject: [eprintsug/EPrintsArchivematica] metadata only items (Issue #35) CAUTION: This e-mail originated outside the University of Southampton.

Currently, there is an issue that causes a transfer error in Archivematica for any items exported using the plugin that are metadata-only (no document/upload of any kind). What happens in these cases, for the export, is that only metadata is exported, and we end up with a folder structure that has an "objects"->"documents" and "derivatives" folders that are empty, and an empty "checksum.md5" file in "metadata". The export contains only metadata.json, EP3.xml and "revision" XML files. This fails at the transfer stage of Archivematica, with an error at the "verify checksum" stage. That's because there is no checksums included in the file checksum file - seeing as there are no "objects" to checksum.

We need to decide how to deal with this at the plugin level. The argument can be made that metadata-only records don't belong in Archivematica, as it's metadata only. If that's the way forward, I suggest that we add a condition in the plugin and for those eprints with no documents, we export them to a different folder location (set as an option in the cfg.d, $c->{archivematica}->{metadata_only_path}). This means that they would simply get skipped over by archivematica automation tools (as automation tools is typically monitoring the main {path}), but nevertheless, the metadata would get exported out, perhaps to be stored along with any of the "logs" from the preservation batch jobs (create_transfer, process_transfer, touch_transfer). That would mean that we would have to add another option for this additional location, and "process" the metadata only onto there. In the AM dataset, the actions would be logged, but the record would not receive an "Archived" status with an AM UUID for the AIP. If ever a document was added to the eprint, then it would make its way to the main $c->{archivematica}->{path}.

The other option would be to attempt to modify the transfer in a way that would get Archivematica to accept/ingest an AIP with no "objects"; but I don't think that's a good idea.

Any comments on that? I would like to resolve this issue, as I think the plugin is ready for a release once this is resolved.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprintsug%2FEPrintsArchivematica%2Fissues%2F35&data=04%7C01%7Cjustin%40soton.ac.uk%7C3d6cf4c220f8416e444a08d9ba8421bb%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637745898932289387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=6A5YU9wlcx7iCpmPWw%2FBffN715rl2ONQfqdWteXxaQY%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAWJKWMADX52COHCUYSYGTLUP6ZMFANCNFSM5JUUPDZQ&data=04%7C01%7Cjustin%40soton.ac.uk%7C3d6cf4c220f8416e444a08d9ba8421bb%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637745898932289387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rmrhtuQu%2FN5lGVNxGk%2BNuCrhyIrg2mTEAMIR05EikQg%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cjustin%40soton.ac.uk%7C3d6cf4c220f8416e444a08d9ba8421bb%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637745898932299349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KM8FDvYRZmvUaz2xjJMrYiKHUhOhlZqwVifQeu7k16Y%3D&reserved=0 or Androidhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cjustin%40soton.ac.uk%7C3d6cf4c220f8416e444a08d9ba8421bb%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637745898932309297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Eo5InOoe%2F9%2Fs2Il3sbLojzpub7liAryCKzX5KnOP8aU%3D&reserved=0.

jesusbagpuss commented 2 years ago

If a metadata-only record gets a file added, how should the transfer from metadata_only_path to the main path work?

I think the proposal for metadata_only_path is good. If that config was undef, would that mean that metadata only records are not in-scope for Archivematica? I think this would be sensible, and allow repositories adopting the plugin to not send metadata only records anywhere near Archivematica.

photomedia commented 2 years ago

There is no need to transfer from the metadata_only_path folder if a file is added. That's because a file added would trigger a new export flag on that item, so the next time that process_transfers is run, it would export everything out to the archivematica->path folder where it would get picked up for archiving. The metadata-only folder would have the metadata that was exported there until it is deleted by the administrator "manually", just like other logs. All of the history of processing, including the metadata-only actions would get logged in the AM dataset in EPrints.

Thanks for the suggestion that leaving the metadata_only_path undefined would result in the record being skipped and the metadata not getting exported out anywhere - that's how I did it.

The change is made now in this commit: https://github.com/eprintsug/EPrintsArchivematica/commit/36b645448101176cf5e46d5933ac72866a513ec5

Thanks for your help!