Large corpus: Automation Tools stalling

geo-mac commented 2 years ago

This is ostensibly an Archivematica issue but future improvements to the plug-in could deliver improvements to the export and ingest process.

In instances where a large number of items have been exported from EPrints for Archivematica ingest, it is common for Archivematica's 'Automation Tools' (AT) functionality to stall. Liaison with Artefactual indicates that this is because AT can struggle with the number of directories in the transfer source (which for us is > 60,000). But even at lower numbers AT can stall, requiring technical intervention at the Archivematica side to re-start AT. This stalling can happen frequently and, as EPrints repositories expand their preservation exports, this will become an increasing issue.

A possible solution/improvement might be to enhance the directory structure used to export EPrints by adding an additional level of hierarchy. Instead of exporting by AID individually only, export each EPrint AID directory within a corresponding parent directory -- for example, grouping AIDs by 1-999, 1000-1999, 2000-2999, and so forth?

AID-eprints-export

photomedia commented 2 years ago

Thanks. This is why I added the --limit parameter to the bin scripts (https://github.com/eprintsug/EPrintsArchivematica#bin-scripts). I use this to set limits as to maximum how many items at one time I process. For the initial export of all of our content, we will need to export out 20000 items, but I wont do this all at once and instead, work in batches of 500 or 1000 at a time. After each batch is done and processed, I check out the logs/Archivematica records in the GUI, to see that all items have a UUID in EPrints, meaning they were all archived, and then clear the transfers in Archivematica dashboard and clear the transfer directory. After the initial export of 20000+ items is done, I don't imagine I will need these limits, because we will just be exporting out 1 days' (or 1 weeks') worth of deposits.

geo-mac commented 2 years ago

Cheers Tomasz. The limit parameter is very helpful indeed. :-) This suggestion is just that: something for the future!

To be honest, I think the issue is more for repositories connected to a CRIS (i.e. Pure). Pure performs so many updates to individual eprints such that every week many thousands of items are 'touched'. It is difficult to know whether some of these touches are significant or not, or to distinguish them from updates initiated by team members, so re-processing them all is the only safe course of action (even with the strictest export triggers imposed). And, of course, processing quickly is necessary before they are touched again. In these instances it would be simpler to export and then re-process everything because repeated intervention is necessary to process in batches. Modifying the directory structure could mean that the job could be added to the cron tab and intervention could be minimized. Something to ruminate!

photomedia commented 2 years ago

Ok, I understand. Let's keep this open for comments for a while. My impression is that this change would be very difficult to implement. My thoughts as to why are the following: 1) Are you sure that combining the AIPs into one "batch AIP" would actually resolve the issue that you're describing?
2) Would switching the trigger to fileinfo only resolve the issue? Ultimately, archivematica preservation is most appropriate for digital objects (files), so if these are not changing, and only some metadata about these files is changing, it might be more appropriate not to re-export for changes to metadata, especially if these are frequent. 3) Would we send back with the callback from Archivematica storage controller the same UUID for each archivematica object in the batch? Or would we have to change the whole 1-1 structure between Archivematica objects and EPrints to make these a 1-many structure?

geo-mac commented 2 years ago

My impression is that this change would be very difficult to implement.

Yes, I agree. It occurred to me too that -- even if it could be implemented -- it would cause problems for repositories using the existing plugin and directory structure. It might not be worth going here!

Would switching the trigger to fileinfo only resolve the issue?

It certainly helps -- this is currently the only trigger we have enabled. But it doesn't completely resolve the issue because Pure's interaction with repositories (inc. DSpace) is very primitive. Interactions use Elsevier's proprietary connector rather than SWORD. Pure has dozens of cron jobs, some of which then initiate a write to EPrints, and this sometimes includes over-writing a file even if there has been no change to the file. But, to EPrints, it appears as if there has been a file change. 👎 Things would be a lot easier if Pure just used SWORD like any normal system.

Are you sure that combining the AIPs into one "batch AIP" would actually resolve the issue that you're describing?

I'm not sure but I am hoping so! :-) Seems to be working so far but I can report can soon with the benefit of further testing.

Would we send back with the callback from Archivematica storage controller the same UUID for each archivematica object in the batch? Or would we have to change the whole 1-1 structure between Archivematica objects and EPrints to make these a 1-many structure?

This question in relation to a potential change to the directory structure, yes? If so, my instinct says the structure would have to change because it would be suboptimal to have a lack of UUID specificity. But I guess this is another reason for us to conclude, 'Here be dragons!' ;-)

eprintsug / EPrintsArchivematica

Large corpus: Automation Tools stalling #36