eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

Archivematica Sending Information Back to EPrints #10

Closed photomedia closed 4 years ago

photomedia commented 5 years ago

This issue is for the question of what, if any, information should be sent back to EPrints. Would it be sufficiently useful to have an Archivematica ID (UUID?) sent back to EPrints whenever Archivematica processes it for preservation? Is this information necessary to have in EPrints? Is it worth the development effort to add this to the plugin? If so, would this information be sent back to EPrints using a SWORD call? What are the limitations/capabilities of Archivematica in this respect?

tw4l commented 5 years ago

Rationale

It would be very useful from a management and quality assurance perspective to be able to confirm that an EPrint was succesfully exported, Archivematica picked up the transfer, and Archivematica successfully created and stored an AIP (Archival Information Package) for the transfer all from the same management screen. This would remove the need to visit two different systems to confirm an overall workflow success.

Implementation

I see two possible methods of implementing in Archivematica: 1) The Archivematica Storage Service application has in-built functionality to make REST calls to external services following certain actions (e.g. successfully storing an AIP). The Archidora Archivematica-Islandora integration, for example, makes use of this functionality to trigger actions in Islandora following the AIP storage event. 2) We could use Archivematica's automation-tools framework - which we are already planning to use in conjunction with this plugin - to send a POST request to EPrints with the relevant information following confirmation that a package has successfully been stored. This is a similar approach to the one that University of York (UK) has taken to update their researchdatayork application from Archivematica (see blog post and the status.py script they're using to accomplish this).

Either approach would require a REST endpoint in EPrints that the Archivematica Storage Service would make a request to, which would in turn update the relevant row in the archivematica table in EPrints with the AIP's UUID and/or a "success" boolean value.

Further investigation and thinking about the approaches is needed.

Storage Service documentation links

tw4l commented 5 years ago

From the perspective of the rationale listed above, writing the AIP UUID from Archivematica back and a "success" status indicator to the appropriate row in EPrints would be sufficient.

The question of whether it's worth the development effort is still very open. It would be nice to hear the perspectives of other potential users of this plugin.

bdgregg commented 5 years ago

Tim,

I would like to throw in my support for the integration of EPrints and Archivematica. We use EPrints here to handle our Electronic Thesis and Dissertations (ETDs) and would like to have those transferred either individually or in an ongoing fashion to the Archivematica system. So from EPrints -> Archivematica would be a very nice add on to our institutional repository.

I could easily envision:

  1. Add a link/button in EPrints that would send the EPrint to Archivematica as a once off, and then storing the resulting UUID in EPrints.
  2. Add information in the EPrints record that the EPrint has been stored in Archivematica and maybe provide a link to the object in to Archivematica (based upon the above UUID).
  3. Add a process where given some search criteria having the ability to confirm that content in the archive has been submitted to Archivematica (verification purposes), thus the ability to determine what has been stored in Archivematica and what has not. Thinking of a report here.
  4. An optional nightly cron job that would push all newly created objects on a daily basis into Archivematica (maybe limit on the type(s) of eprints to archive as well).
  5. Add to the EPrints export to Archivematica some metadata that indicates what archive the EPrint came from (something that can be searched on in Archivematica).

We are just looking at Archivematica here and are looking at how to tie it into our systems here and your concept hit spot on for us.

-Brian Gregg

Brian D. Gregg Solutions Architect University of Pittsburgh. University Library System.

From: Tim Walsh notifications@github.com Sent: Thursday, July 25, 2019 2:49 PM To: eprintsug/EPrintsArchivematica EPrintsArchivematica@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [eprintsug/EPrintsArchivematica] Archivematica Sending Information Back to EPrints (#10)

From the perspective of the rationale listed above, writing the AIP UUID from Archivematica back and a "success" status indicator to the appropriate row in EPrints would be sufficient.

The question of whether it's worth the development effort is still very open. It would be nice to hear the perspectives of other potential users of this plugin.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprintsug%2FEPrintsArchivematica%2Fissues%2F10%3Femail_source%3Dnotifications%26email_token%3DAAC5B3URYHXOOXKZAV2WI43QBHYSRA5CNFSM4IG5BLYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD22NTRQ%23issuecomment-515168710&data=02%7C01%7Cbdgregg%40pitt.edu%7C7c392c5d61c7414c560408d71130c9bf%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C636996773556292873&sdata=ifDqVuQwBFazi0JZ7XquswcXIY143Fw593YRjL5Fteg%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAC5B3XE4PTHLPGSR6IEWIDQBHYSRANCNFSM4IG5BLYA&data=02%7C01%7Cbdgregg%40pitt.edu%7C7c392c5d61c7414c560408d71130c9bf%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C636996773556302869&sdata=0rcevBaJ8Iltej%2FwT17IU5qtktPUVa9Oqmf8RKuTEwI%3D&reserved=0.

geo-mac commented 5 years ago

@timothyryanwalsh

Either approach would require a REST endpoint in EPrints that the Archivematica Storage Service would make a request to, which would in turn update the relevant row in the archivematica table in EPrints with the AIP's UUID

A RESTful endpoint has been supported since EPrints 3.2. I have never used it but, based on threads from the EP-tech listserve, I have heard - brace yourself - that it isn't well documented! This historic blog post from the DepositMOre project describes what is possible though in conjunction with SWORD2. CRUD is definitely supported but one would assume that more is possible in 2019, and there are certainly institutions performing sophisticated m2m interactions with the EPrints REST endpoint, e.g. CalTech.

Would it be sufficiently useful to have an Archivematica ID (UUID?) sent back to EPrints whenever Archivematica processes it for preservation? @photomedia

I am inclined to say that it would be sufficient to have the Archivematica UUID in EPrints for each corresponding eprint. From this alone a lot can be inferred, e.g. receipt of an Archivematic UDDI cognate to confirmation of stored, construct of link to object/AIP in Archivematica using UDDI, etc.

photomedia commented 5 years ago

Based on this discussion, I am adding a section in the README to specify that we need for Archivematica to send back an Archivematica UUID to EPrints, using (option 1: Archivematica Storage Service application in-built functionality to make REST calls) when it has processed the item. I think the preference would be to use a RESTful endpoint in EPrints.

tw4l commented 5 years ago

After looking into this a bit further, I've confirmed that in Archivematica Storage Service 0.15+, we can add a post-store callback that would send a GET, POST, PUT, or PATCH request to EPrints containing the AIP's UUID in the body of the request. The URI that this is sent to is configurable.

The question I'm facing is: how will the Archivematica Storage Service know what the Eprints Archivematica dataset ID is, so that we can construct the right URI for the API call? Presumably we would have to include it with the transfer in some way and then figure out a way to make that information available to the Storage Service.

tw4l commented 5 years ago

Screenshot to show the Edit Callback screen from the Archivematica Storage Service (qa/0.x branch as of yesterday):

AM_SS_Callback_Setup
photomedia commented 4 years ago

The question I'm facing is: how will the Archivematica Storage Service know what the Eprints Archivematica dataset ID is, so that we can construct the right URI for the API call? Presumably we would have to include it with the transfer in some way and then figure out a way to make that information available to the Storage Service.

Yes, we will need to pass a in the AIP metadata, which will contain the EPrints Archivematica dataset ID. The Callback will return this in the body, along with the Archivematica "AIP UUID':

Body: {AIP UUID': '', #this is the Archivematica UUID for the AIP AIP SOURCE_ID:'' #this is the EPrints Archivematica dataset ID passed from EPrints }

tw4l commented 4 years ago

From my understanding of the link on the EPrints REST endpoint that @geo-mac shared above, I think the simplest implementation for EPrints would be to use the EPrints Archivematica dataset ID (which uniquely identifies the row in the Archivematica table in EPrints that will store the AIP UUID and associate it to the correct EPrint) to construct the URI, and pass only the AIP UUID in the body. That way the API call from Archivematica would update the appropriate resource directly, without requiring additional logic/programming on the EPrints side.

We could instead pass the EPrints ID in the URI or body, and then add logic to EPrints to have it go find the appropriate row in the Archivematica dataset and update the UUID, as @photomedia suggests above.

Probably to figure out which is a better option for us, we need more clarity on:

@photomedia - Would you mind looking into the second question, particularly around if there is updated documentation available? Is this up to date/accurate?

photomedia commented 4 years ago

We could instead pass the EPrints ID in the URI or body, and then add logic to EPrints to have it go find the appropriate row in the Archivematica dataset and update the UUID, as @photomedia suggests above.

@timothyryanwalsh , Actually, I suggested that the EPrints Archivematica dataset ID is used directly rather than the EPrint ID.

tw4l commented 4 years ago

Yes, you did! My mistake!

photomedia commented 4 years ago

What exactly the SWORD API endpoints available to us in EPrints are @photomedia - Would you mind looking into the second question, particularly around if there is updated documentation available? Is this up to date/accurate?

I think that @wfyson would be the best person to comment on that. Will, could you please let us know what would be the preferred way that Archivematica would send the callback request to have the Archivematica Dataset updated in EPrints with the UUID of the processed item? Should we make a CRUD request as described by the documentation here http://wiki.eprints.org/w/API:EPrints/Apache/CRUD ?

wfyson commented 4 years ago

Hi @photomedia @timothyryanwalsh,

Apologies for the delay in getting back to you about this! A CRUD request as documented at the link above would be the best way to go about updating an existing record of the new Archivematica dataset that this EPrints plugin would introduce.

To do this you'd need to know the ID of the Archivematica record (i.e. the unique ID that EPrints stores for the record, not the UUID) and then we can use either a default EPrints import plugin (like the XML plugin) or a custom import plugin we could develop as part of this work to PUT an update to the archivematica record.

Expressed as a curl command it would look something like this:

curl -v -H "Content-Type: application/vnd.eprints.data+xml;" -X PUT --data-binary "@/path/to/data.xml" -u <username>:<password> http://myrepository.org/id/archivematica/<id>

We'd also need to add a UUID field to the new archivematica dataset as defined in https://github.com/eprintsug/EPrintsArchivematica/blob/master/lib/plugins/EPrints/DataObj/Archivematica.pm but that shouldn't be a problem.

I hope this helps answer your question! Let me know if there's any more information you need!

photomedia commented 4 years ago

Thank you, @wfyson

To do this you'd need to know the ID of the Archivematica record (i.e. the unique ID that EPrints stores for the record, not the UUID)

Yes, so after some discussion with Archivematica / Artefactual, the simplest way for Archivematica to make the callback and include the EPrints Archivematica Dataset ID for the AIP in it is to have it included in the filename of the AIP that is passed. Archivematica would send back the full filename in the callback. Therefore, I will make a change to the spec of this export pliugin, to include it. Currently, the filename is:

repositoryid-eprintid-lastmoddate

I will change this in the spec to:

repositoryid-eprintid-lastmoddate---EPrintsArchivematicaDatasetID

https://github.com/eprintsug/EPrintsArchivematica/commit/0a8676a30a7d75f9afe023a6ccf68f6096f1f333

https://github.com/eprintsug/EPrintsArchivematica/commit/2a8e0b9dad709da3ecbed9923297c0e8f9d45fac

photomedia commented 4 years ago

Following discussions with @wfyson and @timothyryanwalsh , we agreed that we will rename the top folder name of the exported AIP from EPrints to the EPrintsArchivematicaDatasetID itself. This will allow us to have a more generalized solution for the callback on Archivematica side, without the need for running split or regex expressions on the folder name to send the ID back to the submitting system. The assumption is that the ID is the folder name. Therefore, I will close this issue, and rename the top folder filename in the spec to:

EPrintsArchivematicaDatasetID

The Archivematica callback will do the following:

curl -v -H "Content-Type: application/vnd.eprints.data+xml;" -X PUT --data-binary "@/path/to/data.xml" -u <username>:<password> http://myrepository.org/id/archivematica/<id>

where <id> = EPrintsArchivematicaDatasetID = AIP folder name and /path/to/data.xml contains the Archivematica UUID of this AIP.

tw4l commented 4 years ago

Thanks @photomedia and @wfyson - this is looking great! I think we've come to a nice solution here.

Two minor notes:

  1. Let's default to JSON rather than XML as the Content-Type, as that will be easier to construct and pass along from Archivematica. Can we agree that the body of that response will be JSON in this format?:
{
  'AIP UUID': '<UUID>'
}
  1. The<id> = EPrintsArchivematicaDatasetID = transfer folder name (the AIP folder name also contains the UUID of the AIP, appended to the end of the name; I will make sure the value that is being used to construct the URI for the PUT request from the Archivematica Storage Service does not include this UUID)

Thanks!

tw4l commented 4 years ago

We could also use just 'UUID' for the key in the JSON body, or whatever other value most cleanly matches to an appropriate name for that column in the EPrints Archivematica dataset table

photomedia commented 4 years ago

@timothyryanwalsh Yes, it doesn't have to be data.xml, it can be data.json with the AIP UUID stored as you suggest. I made the change in the spec to JSON.
I also corrected that we are talking about the transfer folder name, not the full post-Archivematica AIP folder name. Thanks!

photomedia commented 4 years ago
  1. Let's default to JSON rather than XML as the Content-Type, as that will be easier to construct and pass along from Archivematica. Can we agree that the body of that response will be JSON in this format?:
{
  'AIP UUID': '<UUID>'
}

Yes, I am adding to to the README and closing this issue