SFULibrary / islandora_rest_ingester

Command-line tool for ingesting objects via the Islandora REST interface
The Unlicense
3 stars 0 forks source link

Add a "Bagit-friendly" option #2

Open McFateM opened 6 years ago

McFateM commented 6 years ago

In our configuration I've engaged all of the "standard" bagit plugins, except for plugin_object_archivematica_transfers. So I feel like this makes my use relatively "typical", but I want to verify that it is. The vast majority of my bags are structured like so:

/archive/grinnell_bags/Bag-grinnell_99$ ll
total 6
drwxrwxr-x 3 vagrant vagrant 1024 Oct  4 13:21 ./
drwxr-xr-x 3 root    root    1024 Oct  5 09:30 ../
-rw-rw---- 1 vagrant vagrant  291 Oct  4 09:12 bag-info.txt
-rw-rw---- 1 vagrant vagrant   55 Oct  4 09:12 bagit.txt
drwxrwx--- 2 vagrant vagrant 1024 Oct  4 09:12 data/
-rw-rw---- 1 vagrant vagrant  866 Oct  4 09:12 manifest-sha1.txt
-rw-rw---- 1 vagrant vagrant  164 Oct  4 09:12 tagmanifest-sha1.txt

...and...

/archive/grinnell_bags/Bag-grinnell_99/data$ ll
total 377
drwxrwx--- 2 vagrant vagrant   1024 Oct  4 09:12 ./
drwxrwxr-x 3 vagrant vagrant   1024 Oct  4 13:21 ../
-rw-rw---- 1 vagrant vagrant   2791 Oct  4 09:12 ADMIN_COVERSHEET.html
-rw-rw---- 1 vagrant vagrant   2791 Oct  4 09:12 COVERSHEET.html
-rw-rw---- 1 vagrant vagrant   1663 Oct  4 09:12 DC.bin
-rw-rw---- 1 vagrant vagrant    425 Oct  4 09:12 foo.xml
-rw-rw---- 1 vagrant vagrant 206709 Oct  4 09:12 foxml.xml
-rw-rw---- 1 vagrant vagrant  18535 Oct  4 09:12 FULL_TEXT.txt
-rw-rw---- 1 vagrant vagrant   2683 Oct  4 09:12 MODS-2015.Apr.22.bin
-rw-rw---- 1 vagrant vagrant   2803 Oct  4 09:12 MODS.bin
-rw-rw---- 1 vagrant vagrant  44099 Oct  4 09:12 OBJ.pdf
-rw-rw---- 1 vagrant vagrant  13387 Oct  4 09:12 POLICY.bin
-rw-rw---- 1 vagrant vagrant  44709 Oct  4 09:12 PREMIS.xml
-rw-rw---- 1 vagrant vagrant  28436 Oct  4 09:12 PREVIEW.jpg
-rw-rw---- 1 vagrant vagrant   1186 Oct  4 09:12 RELS-EXT.rdf
-rw-rw---- 1 vagrant vagrant   8218 Oct  4 09:12 TN.jpg
-rw-rw---- 1 vagrant vagrant      9 Oct  4 09:12 WF.txt

Note above that MODS gets an extension of .bin, as do some other datastreams. Some XML datastreams consistently come through as .xml. Is that normal for Islandora Bagit with the standard plugins?

So I've taken baby-steps in the code thus far so that a -bf (bagit-friendly) flag will automatically look for MODS.bin instead of MODS.xml, and then rename it to MODS.xml so that it processes as it should. In addition, I found it necessary to "skip" some of the derivative datastreams to avoid duplication which generally kills the process prematurely. These currently include 'DC', 'RELS-EXT', 'PREVIEW', 'TN', 'foo', and 'foxml'.

If I take these steps I'm able to "restore" an object from it's bag without issue. Question is... should I be taking steps like this with a REST ingester option, or would it be better to build a new Islandora Bagit plugin, or implement a post-processing hook there, to prepare a bag that can be REST ingested without significant modification?

whikloj commented 6 years ago

@McFateM Are you using the islandora_bagit module? Also what mimetype do your MODS datastreams have in Fedora?

McFateM commented 6 years ago

@whikloj Yes, I'm using islandora_bagit. MODS of the sample object, grinnell:99, has a MIME type of text/xml, but many of our newer objects have MODS of type application/xml, I believe. And I think you already hinted at the clue I found about an hour ago... datastreams with MIME type of application/xml create *.xml files in bagit. That's fine, but I sometimes change the MIME type to text/xml, matching the "older" specification, so that I can open and modify the MODS in the Fedora Web Administrator.

whikloj commented 6 years ago

Ok so that is retrieved here: https://github.com/Islandora/islandora_bagit/blob/7.x/islandora_bagit.module#L1033 and comes from the /etc/mime.types file on the server.

So I would open that up and look for text/xml and see if there is xml beside it. If not you can try adding it. 😄

McFateM commented 6 years ago

Thanks @whikloj. text/xml was not in my /etc/mime.types file so I followed instruction there and created a .mime.types file in my home directory. I'll give it a try later to see if Bagit picks up on the new text/xml = xml directive.

McFateM commented 6 years ago

Following up here to be complete... Adding a .mime.types file to my home directory didn't work, perhaps because the account running Bagit isn't mine? Anyway, I made the addition of "text/xml xml" directly in /etc/mime.types, rebooted the server, and voila! I all of my text/xml datastreams now bag as .xml rather than .bin files.

mjordan commented 6 years ago

Sorry to be absent from this, I was on the road (to be accurate, in an airplane :face_with_head_bandage:)

@McFateM, would the change in Islandora from assigning text/xml to some datastreams in your repo and application/xml to others correspond to the mimetype mapping redefined in January 2015 (7.x-1.4) as described in https://jira.duraspace.org/browse/ISLANDORA-1045?

At the last committers call we discussed https://jira.duraspace.org/browse/ISLANDORA-1612, and some of us had suggested in IRC that creating a drupal_alter() to define custom, per module mime type mappings would be a possible solution. I'd say that the case you're is hitting is a perfect example.

Pinging @DiegoPino and @rosiel since they were part of the IRC discussion. I'm happy to open a new ticket and ping all committers.

DiegoPino commented 6 years ago

see https://github.com/Islandora/islandora_solution_pack_video/pull/139/files This one just got merged. Simple solution, in that case, it was just about changing the order.

Seems like we can easily assume that even after our normalization changes people have still as text/xml defined DS. Makes sense, because https://jira.duraspace.org/browse/ISLANDORA-1045 suggested a change only for new DS.

I feel in this case, we really want everything to be application/XML (normalize) more than adapt to legacy text/xml?

At https://github.com/Islandora/islandora_bagit/blob/6ef89baa2e3d5686d5568f2d3520b9559e684495/islandora_bagit.module#L1028 we could be getting the mime from the DS directly force-map all text/xml to application/xml and then do the actual mime detect. It would be a regression exception that would be easier to manage maybe. Also maybe i'm speaking non-sense 😬

DiegoPino commented 6 years ago

Another option for the bag it module:

Instead of https://github.com/Islandora/islandora_bagit/blob/6ef89baa2e3d5686d5568f2d3520b9559e684495/islandora_bagit.module#L1026-L1063

Lets do like https://github.com/Islandora/islandora/blob/7.x/includes/datastream.inc#L53-L67

$extension = '.' . islandora_get_extension_for_mimetype($datastream->mimetype);
    // Prevent adding on a duplicate extension.
    $label = $datastream->label;
    $extension_length = strlen($extension);
    $duplicate_extension_position = strlen($label) > $extension_length ?
      strripos($label, $extension, -$extension_length) :
      FALSE;
    $filename = $label;
    if ($duplicate_extension_position === FALSE) {
      $filename .= $extension;
    }
    header("Content-Disposition: attachment; filename=\"$filename\"");
  }
McFateM commented 6 years ago

Confirming that the text/xml to application/xml shift in my case did correspond to January 2015 changes. This is straying a bit off-topic perhaps, but I still like the text/xml type simply because they can be edited in the Fedora Web Admin interface (at least in my experience) where application/xml datastreams cannot.

mjordan commented 6 years ago

@DiegoPino I didn't know we already had hook_file_mimetype_mapping_alter(). Anyway, just going into a meeting, will rejoin the convo later.

mjordan commented 6 years ago

@McFateM in your initial post above you wrote:

I found it necessary to "skip" some of the derivative datastreams to avoid duplication which generally kills the process prematurely. These currently include 'DC', 'RELS-EXT', 'PREVIEW', 'TN', 'foo', and 'foxml'

Can you elaborate on this? Which process were you referring to - the one running the REST ingester script, or the one on the web server side? I might want you to open a separate issue for to address this behavior, but please let me know more before we decide on that.

McFateM commented 6 years ago

My bags contain a whole host of data streams, including derivatives. When I ingest such an object the process returns an error on the first data stream that “already exists”, presumably because the ingest is generating some of these as derivatives.

I am not at a computer right now, I am at a football game selling tickets, but when I get back to a computer I will give this a try and send more details.

Sent from my iPhone

On Oct 6, 2017, at 5:58 PM, Mark Jordan notifications@github.com<mailto:notifications@github.com> wrote:

@McFateMhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mcfatem&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=Tk6L2c--uyk0NDKBEZqJSy6M_9AsXLJbK9lvR7jE3Tw&e= in your initial post above you wrote:

I found it necessary to "skip" some of the derivative datastreams to avoid duplication which generally kills the process prematurely. These currently include 'DC', 'RELS-EXT', 'PREVIEW', 'TN', 'foo', and 'foxml'

Can you elaborate on this? Which process were you referring to - the one running the REST ingester script, or the one on the web server side? I might want you to open a separate issue for to address this behavior, but please let me know more before we decide on that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjordan_islandora-5Frest-5Fingester_issues_2-23issuecomment-2D334888964&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=pOiiYiaahiQAVn1FJmVX9t39XXjhig_3eY7L1QGmEEM&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIFIwdV2KZ9RZ2iGuiPps1mIu1-2Df81mVks5sprCdgaJpZM4PvOSz&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=68qVmPrWvETvyDX3_YOMkJrlZG7v_JXrhfR6CUWq3RE&e=.

McFateM commented 6 years ago

Just documenting behavior here…

If I attempt to ingest an object (PID) that already exists I get this, as expected.

vagrant@dgadmin:/archive/grinnell_bags$ php ~/islandora_rest_ingester/ingest.php -l mylog.log -m islandora:sp_pdf -p grinnell:test -n test:12345 -o "System Admin" -t “xxxxxxxxxxx" -u "System Admin" /archive/grinnell_bags/Bag-grinnell_99

HTTP/1.1 500 Internal Server Error

Date: Sat, 07 Oct 2017 14:50:21 GMT

Server: Apache/2.4.7 (Ubuntu)

X-Content-Type-Options: nosniff, nosniff

X-Powered-By: PHP/5.5.9-1ubuntu4.22

X-Drupal-Cache: MISS

Expires: Sun, 19 Nov 1978 05:00:00 GMT

Cache-Control: no-cache, must-revalidate

Set-Cookie: SESS2f91b0390f7e2978fb09504dad1dcaaa=-vSXMQ2HjWNMZlRHth9BKm5wZqQLCxFvhtivFvF4XLw; expires=Mon, 30-Oct-2017 18:23:41 GMT; Max-Age=2000000; path=/; domain=dgadmin.grinnell.edu; HttpOnly

Content-Length: 14

Connection: close

Content-Type: application/json; utf-8

{"message":""}

…and the log contains…

[2017-10-07 09:50:21] Ingest via REST.INFO: ingest.php (endpoint http://localhost/islandora/rest/v1) started at October 7, 2017, 9:50 am [] []

[2017-10-07 09:50:21] Ingest via REST.INFO: ingest.php running in bagit-friendly (-b) mode. [] []

[2017-10-07 09:50:21] Ingest via REST.ERROR: POST /islandora/rest/v1/object HTTP/1.1 User-Agent: GuzzleHttp/6.2.1 curl/7.35.0 PHP/5.5.9-1ubuntu4.22 Content-Type: application/x-www-form-urlencoded Host: localhost Accept: application/json X-Authorization-User: System Admin:xxxxxxxxx namespace=test%3A12345&owner=System+Admin&label=Violence+Causes+Hunger+in+Guatemala%3A+From+Coups+to+CAFTA [] []

[2017-10-07 09:50:21] Ingest via REST.ERROR: HTTP/1.1 500 Internal Server Error Date: Sat, 07 Oct 2017 14:50:21 GMT Server: Apache/2.4.7 (Ubuntu) X-Content-Type-Options: nosniff, nosniff X-Powered-By: PHP/5.5.9-1ubuntu4.22 X-Drupal-Cache: MISS Expires: Sun, 19 Nov 1978 05:00:00 GMT Cache-Control: no-cache, must-revalidate Set-Cookie: SESS2f91b0390f7e2978fb09504dad1dcaaa=-vSXMQ2HjWNMZlRHth9BKm5wZqQLCxFvhtivFvF4XLw; expires=Mon, 30-Oct-2017 18:23:41 GMT; Max-Age=2000000; path=/; domain=dgadmin.grinnell.edu; HttpOnly Content-Length: 14 Connection: close Content-Type: application/json; utf-8 {"message":""} [] []

If I attempt to ingest the same content to a new, non-existent object without the –b (Bagit-friendly) option I get this output and log...

vagrant@dgadmin:/archive/grinnell_bags$ php ~/islandora_rest_ingester/ingest.php -l mylog.log -m islandora:sp_pdf -p grinnell:test -n test -o "System Admin" -t “xxxxxxxx" -u "System Admin" /archive/grinnell_bags/Bag-grinnell_99

HTTP/1.1 409 Conflict

Date: Sat, 07 Oct 2017 15:02:35 GMT

Server: Apache/2.4.7 (Ubuntu)

X-Content-Type-Options: nosniff, nosniff

X-Powered-By: PHP/5.5.9-1ubuntu4.22

X-Drupal-Cache: MISS

Expires: Sun, 19 Nov 1978 05:00:00 GMT

Cache-Control: no-cache, must-revalidate

Set-Cookie: SESS2f91b0390f7e2978fb09504dad1dcaaa=6rxJ3PoM8ub_LHo4xyoCeLqXuAIURIXkKIxaVaD6pXc; expires=Mon, 30-Oct-2017 18:35:55 GMT; Max-Age=2000000; path=/; domain=dgadmin.grinnell.edu; HttpOnly

Content-Length: 49

Content-Type: application/json; utf-8

{"message":"Conflict: Datastream already exists"}

[2017-10-07 10:02:32] Ingest via REST.WARNING: /archive/grinnell_bags/Bag-grinnell_99/bagit.txt appears to be empty, skipping. [] []

[2017-10-07 10:02:32] Ingest via REST.WARNING: /archive/grinnell_bags/Bag-grinnell_99/tagmanifest-sha1.txt appears to be empty, skipping. [] []

[2017-10-07 10:02:32] Ingest via REST.WARNING: /archive/grinnell_bags/Bag-grinnell_99/bag-info.txt appears to be empty, skipping. [] []

[2017-10-07 10:02:32] Ingest via REST.INFO: Object test:22580 ingested from /archive/grinnell_bags/Bag-grinnell_99/data [] []

[2017-10-07 10:02:34] Ingest via REST.INFO: Object test:22580 datastream ADMIN_COVERSHEET ingested from /archive/grinnell_bags/Bag-grinnell_99/data/ADMIN_COVERSHEET.html [] []

[2017-10-07 10:02:34] Ingest via REST.INFO: SHA-1 checksum for object test:22580 datastream ADMIN_COVERSHEET verified. [] []

[2017-10-07 10:02:35] Ingest via REST.INFO: Object test:22580 datastream COVERSHEET ingested from /archive/grinnell_bags/Bag-grinnell_99/data/COVERSHEET.html [] []

[2017-10-07 10:02:35] Ingest via REST.INFO: SHA-1 checksum for object test:22580 datastream COVERSHEET verified. [] []

[2017-10-07 10:02:35] Ingest via REST.ERROR: POST /islandora/rest/v1/object/test:22580/datastream HTTP/1.1 User-Agent: GuzzleHttp/6.2.1 curl/7.35.0 PHP/5.5.9-1ubuntu4.22 Content-Type: multipart/form-data; boundary=cc6ce0257e1d7c751046cbceb3172c6c6016575a Host: localhost Accept: application/json X-Authorization-User: System Admin:xxxxxxx --cc6ce0257e1d7c751046cbceb3172c6c6016575a Content-Disposition: form-data; name="file"; filename="DC.xml" Content-Length: 1663 Content-Type: application/xml Violence Causes Hunger in Guatemala: From Coups to CAFTA</dc:title> alternative: From Coups to CAFTA</dc:title> Coups d'état</dc:subject> Guatemala</dc:subject> CAFTA (Free trade agreement) 2005</dc:subject> </dc:subject> Leah Lucas' submission to the 2012 Peace Studies Student Conference</dc:description> Grinnell College</dc:publisher> Lucas, Leah (author)</dc:contributor> Grinnell College. Peace Studies Program (supporting host)</dc:contributor> 2012-03</dc:date> extent = 9 pages</dc:format> internetMediaType = pdf</dc:format> grinnell:99</dc:identifier> http://hdl.handle.net/11084/99</dc:identifier> English</dc:language> isPartOf: Peace Studies Student Conference</dc:relation> isPartOf: Social Justice at Grinnell</dc:relation> isPartOf: Digital Grinnell</dc:relation> Guatemala</dc:coverage> Copyright to this work is held by the author(s), in accordance with United States copyright law (USC 17). Readers of this work have certain rights as defined by the law, including but not limited to fair use (17 USC 107 et seq.).</dc:rights> </oai_dc:dc> --cc6ce0257e1d7c751046cbceb3172c6c6016575a Content-Disposition: form-data; name="dsid" Content-Length: 2 DC --cc6ce0257e1d7c751046cbceb3172c6c6016575a Content-Disposition: form-data; name="checksumType" Content-Length: 5 SHA-1 --cc6ce0257e1d7c751046cbceb3172c6c6016575a-- [] []

[2017-10-07 10:02:35] Ingest via REST.ERROR: HTTP/1.1 409 Conflict Date: Sat, 07 Oct 2017 15:02:35 GMT Server: Apache/2.4.7 (Ubuntu) X-Content-Type-Options: nosniff, nosniff X-Powered-By: PHP/5.5.9-1ubuntu4.22 X-Drupal-Cache: MISS Expires: Sun, 19 Nov 1978 05:00:00 GMT Cache-Control: no-cache, must-revalidate Set-Cookie: SESS2f91b0390f7e2978fb09504dad1dcaaa=6rxJ3PoM8ub_LHo4xyoCeLqXuAIURIXkKIxaVaD6pXc; expires=Mon, 30-Oct-2017 18:35:55 GMT; Max-Age=2000000; path=/; domain=dgadmin.grinnell.edu; HttpOnly Content-Length: 49 Content-Type: application/json; utf-8 {"message":"Conflict: Datastream already exists"} [] []

From: Mark Jordan notifications@github.com<mailto:notifications@github.com> Reply-To: mjordan/islandora_rest_ingester reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, October 6, 2017 at 5:58 PM To: mjordan/islandora_rest_ingester islandora_rest_ingester@noreply.github.com<mailto:islandora_rest_ingester@noreply.github.com> Cc: Mark McFate mcfatem@grinnell.edu<mailto:mcfatem@grinnell.edu>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [mjordan/islandora_rest_ingester] Add a "Bagit-friendly" option (#2)

@McFateMhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mcfatem&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=Tk6L2c--uyk0NDKBEZqJSy6M_9AsXLJbK9lvR7jE3Tw&e= in your initial post above you wrote:

I found it necessary to "skip" some of the derivative datastreams to avoid duplication which generally kills the process prematurely. These currently include 'DC', 'RELS-EXT', 'PREVIEW', 'TN', 'foo', and 'foxml'

Can you elaborate on this? Which process were you referring to - the one running the REST ingester script, or the one on the web server side? I might want you to open a separate issue for to address this behavior, but please let me know more before we decide on that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjordan_islandora-5Frest-5Fingester_issues_2-23issuecomment-2D334888964&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=pOiiYiaahiQAVn1FJmVX9t39XXjhig_3eY7L1QGmEEM&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIFIwdV2KZ9RZ2iGuiPps1mIu1-2Df81mVks5sprCdgaJpZM4PvOSz&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=68qVmPrWvETvyDX3_YOMkJrlZG7v_JXrhfR6CUWq3RE&e=.

mjordan commented 6 years ago

@McFateM In REST conventions, replacing or updating an existing object would require an HTTP PUT, whereas the Ingester only currently supports POSTing new objects and datastreams. We'll need to do a bit of development to built in the ability to update/replace object properties and datastreams. This is outside the scope of this issue, so I'll open a new one where we can hash out use cases. How does that sound? Of course, restoring an object from its Bag would be an important use case.

mjordan commented 6 years ago

Duh, #3 is that new issue. Sorry, I've only had one cup of coffee so far this morning.