Open McFateM opened 7 years ago
@McFateM Are you using the islandora_bagit module? Also what mimetype do your MODS datastreams have in Fedora?
@whikloj Yes, I'm using islandora_bagit. MODS of the sample object, grinnell:99, has a MIME type of text/xml, but many of our newer objects have MODS of type application/xml, I believe. And I think you already hinted at the clue I found about an hour ago... datastreams with MIME type of application/xml create *.xml files in bagit. That's fine, but I sometimes change the MIME type to text/xml, matching the "older" specification, so that I can open and modify the MODS in the Fedora Web Administrator.
Ok so that is retrieved here:
https://github.com/Islandora/islandora_bagit/blob/7.x/islandora_bagit.module#L1033
and comes from the /etc/mime.types
file on the server.
So I would open that up and look for text/xml
and see if there is xml
beside it. If not you can try adding it. 😄
Thanks @whikloj. text/xml was not in my /etc/mime.types file so I followed instruction there and created a .mime.types file in my home directory. I'll give it a try later to see if Bagit picks up on the new text/xml = xml directive.
Following up here to be complete... Adding a .mime.types file to my home directory didn't work, perhaps because the account running Bagit isn't mine? Anyway, I made the addition of "text/xml xml" directly in /etc/mime.types, rebooted the server, and voila! I all of my text/xml datastreams now bag as .xml rather than .bin files.
Sorry to be absent from this, I was on the road (to be accurate, in an airplane :face_with_head_bandage:)
@McFateM, would the change in Islandora from assigning text/xml
to some datastreams in your repo and application/xml
to others correspond to the mimetype mapping redefined in January 2015 (7.x-1.4) as described in https://jira.duraspace.org/browse/ISLANDORA-1045?
At the last committers call we discussed https://jira.duraspace.org/browse/ISLANDORA-1612, and some of us had suggested in IRC that creating a drupal_alter()
to define custom, per module mime type mappings would be a possible solution. I'd say that the case you're is hitting is a perfect example.
Pinging @DiegoPino and @rosiel since they were part of the IRC discussion. I'm happy to open a new ticket and ping all committers.
see https://github.com/Islandora/islandora_solution_pack_video/pull/139/files This one just got merged. Simple solution, in that case, it was just about changing the order.
Seems like we can easily assume that even after our normalization changes people have still as text/xml defined DS. Makes sense, because https://jira.duraspace.org/browse/ISLANDORA-1045 suggested a change only for new DS.
I feel in this case, we really want everything to be application/XML (normalize) more than adapt to legacy text/xml?
At https://github.com/Islandora/islandora_bagit/blob/6ef89baa2e3d5686d5568f2d3520b9559e684495/islandora_bagit.module#L1028
we could be getting the mime from the DS directly force-map all text/xml
to application/xml
and then do the actual mime detect. It would be a regression exception that would be easier to manage maybe.
Also maybe i'm speaking non-sense 😬
Another option for the bag it module:
Lets do like https://github.com/Islandora/islandora/blob/7.x/includes/datastream.inc#L53-L67
$extension = '.' . islandora_get_extension_for_mimetype($datastream->mimetype);
// Prevent adding on a duplicate extension.
$label = $datastream->label;
$extension_length = strlen($extension);
$duplicate_extension_position = strlen($label) > $extension_length ?
strripos($label, $extension, -$extension_length) :
FALSE;
$filename = $label;
if ($duplicate_extension_position === FALSE) {
$filename .= $extension;
}
header("Content-Disposition: attachment; filename=\"$filename\"");
}
Confirming that the text/xml to application/xml shift in my case did correspond to January 2015 changes. This is straying a bit off-topic perhaps, but I still like the text/xml type simply because they can be edited in the Fedora Web Admin interface (at least in my experience) where application/xml datastreams cannot.
@DiegoPino I didn't know we already had hook_file_mimetype_mapping_alter(). Anyway, just going into a meeting, will rejoin the convo later.
@McFateM in your initial post above you wrote:
I found it necessary to "skip" some of the derivative datastreams to avoid duplication which generally kills the process prematurely. These currently include 'DC', 'RELS-EXT', 'PREVIEW', 'TN', 'foo', and 'foxml'
Can you elaborate on this? Which process were you referring to - the one running the REST ingester script, or the one on the web server side? I might want you to open a separate issue for to address this behavior, but please let me know more before we decide on that.
My bags contain a whole host of data streams, including derivatives. When I ingest such an object the process returns an error on the first data stream that “already exists”, presumably because the ingest is generating some of these as derivatives.
I am not at a computer right now, I am at a football game selling tickets, but when I get back to a computer I will give this a try and send more details.
Sent from my iPhone
On Oct 6, 2017, at 5:58 PM, Mark Jordan notifications@github.com<mailto:notifications@github.com> wrote:
@McFateMhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mcfatem&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=Tk6L2c--uyk0NDKBEZqJSy6M_9AsXLJbK9lvR7jE3Tw&e= in your initial post above you wrote:
I found it necessary to "skip" some of the derivative datastreams to avoid duplication which generally kills the process prematurely. These currently include 'DC', 'RELS-EXT', 'PREVIEW', 'TN', 'foo', and 'foxml'
Can you elaborate on this? Which process were you referring to - the one running the REST ingester script, or the one on the web server side? I might want you to open a separate issue for to address this behavior, but please let me know more before we decide on that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjordan_islandora-5Frest-5Fingester_issues_2-23issuecomment-2D334888964&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=pOiiYiaahiQAVn1FJmVX9t39XXjhig_3eY7L1QGmEEM&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIFIwdV2KZ9RZ2iGuiPps1mIu1-2Df81mVks5sprCdgaJpZM4PvOSz&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=68qVmPrWvETvyDX3_YOMkJrlZG7v_JXrhfR6CUWq3RE&e=.
Just documenting behavior here…
If I attempt to ingest an object (PID) that already exists I get this, as expected.
vagrant@dgadmin:/archive/grinnell_bags$ php ~/islandora_rest_ingester/ingest.php -l mylog.log -m islandora:sp_pdf -p grinnell:test -n test:12345 -o "System Admin" -t “xxxxxxxxxxx" -u "System Admin" /archive/grinnell_bags/Bag-grinnell_99
HTTP/1.1 500 Internal Server Error
Date: Sat, 07 Oct 2017 14:50:21 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Content-Type-Options: nosniff, nosniff
X-Powered-By: PHP/5.5.9-1ubuntu4.22
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate
Set-Cookie: SESS2f91b0390f7e2978fb09504dad1dcaaa=-vSXMQ2HjWNMZlRHth9BKm5wZqQLCxFvhtivFvF4XLw; expires=Mon, 30-Oct-2017 18:23:41 GMT; Max-Age=2000000; path=/; domain=dgadmin.grinnell.edu; HttpOnly
Content-Length: 14
Connection: close
Content-Type: application/json; utf-8
{"message":""}
…and the log contains…
[2017-10-07 09:50:21] Ingest via REST.INFO: ingest.php (endpoint http://localhost/islandora/rest/v1) started at October 7, 2017, 9:50 am [] []
[2017-10-07 09:50:21] Ingest via REST.INFO: ingest.php running in bagit-friendly (-b) mode. [] []
[2017-10-07 09:50:21] Ingest via REST.ERROR: POST /islandora/rest/v1/object HTTP/1.1 User-Agent: GuzzleHttp/6.2.1 curl/7.35.0 PHP/5.5.9-1ubuntu4.22 Content-Type: application/x-www-form-urlencoded Host: localhost Accept: application/json X-Authorization-User: System Admin:xxxxxxxxx namespace=test%3A12345&owner=System+Admin&label=Violence+Causes+Hunger+in+Guatemala%3A+From+Coups+to+CAFTA [] []
[2017-10-07 09:50:21] Ingest via REST.ERROR: HTTP/1.1 500 Internal Server Error Date: Sat, 07 Oct 2017 14:50:21 GMT Server: Apache/2.4.7 (Ubuntu) X-Content-Type-Options: nosniff, nosniff X-Powered-By: PHP/5.5.9-1ubuntu4.22 X-Drupal-Cache: MISS Expires: Sun, 19 Nov 1978 05:00:00 GMT Cache-Control: no-cache, must-revalidate Set-Cookie: SESS2f91b0390f7e2978fb09504dad1dcaaa=-vSXMQ2HjWNMZlRHth9BKm5wZqQLCxFvhtivFvF4XLw; expires=Mon, 30-Oct-2017 18:23:41 GMT; Max-Age=2000000; path=/; domain=dgadmin.grinnell.edu; HttpOnly Content-Length: 14 Connection: close Content-Type: application/json; utf-8 {"message":""} [] []
If I attempt to ingest the same content to a new, non-existent object without the –b (Bagit-friendly) option I get this output and log...
vagrant@dgadmin:/archive/grinnell_bags$ php ~/islandora_rest_ingester/ingest.php -l mylog.log -m islandora:sp_pdf -p grinnell:test -n test -o "System Admin" -t “xxxxxxxx" -u "System Admin" /archive/grinnell_bags/Bag-grinnell_99
HTTP/1.1 409 Conflict
Date: Sat, 07 Oct 2017 15:02:35 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Content-Type-Options: nosniff, nosniff
X-Powered-By: PHP/5.5.9-1ubuntu4.22
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate
Set-Cookie: SESS2f91b0390f7e2978fb09504dad1dcaaa=6rxJ3PoM8ub_LHo4xyoCeLqXuAIURIXkKIxaVaD6pXc; expires=Mon, 30-Oct-2017 18:35:55 GMT; Max-Age=2000000; path=/; domain=dgadmin.grinnell.edu; HttpOnly
Content-Length: 49
Content-Type: application/json; utf-8
{"message":"Conflict: Datastream already exists"}
[2017-10-07 10:02:32] Ingest via REST.WARNING: /archive/grinnell_bags/Bag-grinnell_99/bagit.txt appears to be empty, skipping. [] []
[2017-10-07 10:02:32] Ingest via REST.WARNING: /archive/grinnell_bags/Bag-grinnell_99/tagmanifest-sha1.txt appears to be empty, skipping. [] []
[2017-10-07 10:02:32] Ingest via REST.WARNING: /archive/grinnell_bags/Bag-grinnell_99/bag-info.txt appears to be empty, skipping. [] []
[2017-10-07 10:02:32] Ingest via REST.INFO: Object test:22580 ingested from /archive/grinnell_bags/Bag-grinnell_99/data [] []
[2017-10-07 10:02:34] Ingest via REST.INFO: Object test:22580 datastream ADMIN_COVERSHEET ingested from /archive/grinnell_bags/Bag-grinnell_99/data/ADMIN_COVERSHEET.html [] []
[2017-10-07 10:02:34] Ingest via REST.INFO: SHA-1 checksum for object test:22580 datastream ADMIN_COVERSHEET verified. [] []
[2017-10-07 10:02:35] Ingest via REST.INFO: Object test:22580 datastream COVERSHEET ingested from /archive/grinnell_bags/Bag-grinnell_99/data/COVERSHEET.html [] []
[2017-10-07 10:02:35] Ingest via REST.INFO: SHA-1 checksum for object test:22580 datastream COVERSHEET verified. [] []
[2017-10-07 10:02:35] Ingest via REST.ERROR: POST /islandora/rest/v1/object/test:22580/datastream HTTP/1.1 User-Agent: GuzzleHttp/6.2.1 curl/7.35.0 PHP/5.5.9-1ubuntu4.22 Content-Type: multipart/form-data; boundary=cc6ce0257e1d7c751046cbceb3172c6c6016575a Host: localhost Accept: application/json X-Authorization-User: System Admin:xxxxxxx --cc6ce0257e1d7c751046cbceb3172c6c6016575a Content-Disposition: form-data; name="file"; filename="DC.xml" Content-Length: 1663 Content-Type: application/xml
[2017-10-07 10:02:35] Ingest via REST.ERROR: HTTP/1.1 409 Conflict Date: Sat, 07 Oct 2017 15:02:35 GMT Server: Apache/2.4.7 (Ubuntu) X-Content-Type-Options: nosniff, nosniff X-Powered-By: PHP/5.5.9-1ubuntu4.22 X-Drupal-Cache: MISS Expires: Sun, 19 Nov 1978 05:00:00 GMT Cache-Control: no-cache, must-revalidate Set-Cookie: SESS2f91b0390f7e2978fb09504dad1dcaaa=6rxJ3PoM8ub_LHo4xyoCeLqXuAIURIXkKIxaVaD6pXc; expires=Mon, 30-Oct-2017 18:35:55 GMT; Max-Age=2000000; path=/; domain=dgadmin.grinnell.edu; HttpOnly Content-Length: 49 Content-Type: application/json; utf-8 {"message":"Conflict: Datastream already exists"} [] []
From: Mark Jordan notifications@github.com<mailto:notifications@github.com> Reply-To: mjordan/islandora_rest_ingester reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, October 6, 2017 at 5:58 PM To: mjordan/islandora_rest_ingester islandora_rest_ingester@noreply.github.com<mailto:islandora_rest_ingester@noreply.github.com> Cc: Mark McFate mcfatem@grinnell.edu<mailto:mcfatem@grinnell.edu>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [mjordan/islandora_rest_ingester] Add a "Bagit-friendly" option (#2)
@McFateMhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mcfatem&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=Tk6L2c--uyk0NDKBEZqJSy6M_9AsXLJbK9lvR7jE3Tw&e= in your initial post above you wrote:
I found it necessary to "skip" some of the derivative datastreams to avoid duplication which generally kills the process prematurely. These currently include 'DC', 'RELS-EXT', 'PREVIEW', 'TN', 'foo', and 'foxml'
Can you elaborate on this? Which process were you referring to - the one running the REST ingester script, or the one on the web server side? I might want you to open a separate issue for to address this behavior, but please let me know more before we decide on that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjordan_islandora-5Frest-5Fingester_issues_2-23issuecomment-2D334888964&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=pOiiYiaahiQAVn1FJmVX9t39XXjhig_3eY7L1QGmEEM&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIFIwdV2KZ9RZ2iGuiPps1mIu1-2Df81mVks5sprCdgaJpZM4PvOSz&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=T_GguMZtY_UhqrYMDJmvRwGR_3IQuBMqR4F4ilRjCyg&s=68qVmPrWvETvyDX3_YOMkJrlZG7v_JXrhfR6CUWq3RE&e=.
@McFateM In REST conventions, replacing or updating an existing object would require an HTTP PUT, whereas the Ingester only currently supports POSTing new objects and datastreams. We'll need to do a bit of development to built in the ability to update/replace object properties and datastreams. This is outside the scope of this issue, so I'll open a new one where we can hash out use cases. How does that sound? Of course, restoring an object from its Bag would be an important use case.
Duh, #3 is that new issue. Sorry, I've only had one cup of coffee so far this morning.
In our configuration I've engaged all of the "standard" bagit plugins, except for plugin_object_archivematica_transfers. So I feel like this makes my use relatively "typical", but I want to verify that it is. The vast majority of my bags are structured like so:
...and...
Note above that MODS gets an extension of .bin, as do some other datastreams. Some XML datastreams consistently come through as .xml. Is that normal for Islandora Bagit with the standard plugins?
So I've taken baby-steps in the code thus far so that a -bf (bagit-friendly) flag will automatically look for MODS.bin instead of MODS.xml, and then rename it to MODS.xml so that it processes as it should. In addition, I found it necessary to "skip" some of the derivative datastreams to avoid duplication which generally kills the process prematurely. These currently include 'DC', 'RELS-EXT', 'PREVIEW', 'TN', 'foo', and 'foxml'.
If I take these steps I'm able to "restore" an object from it's bag without issue. Question is... should I be taking steps like this with a REST ingester option, or would it be better to build a new Islandora Bagit plugin, or implement a post-processing hook there, to prepare a bag that can be REST ingested without significant modification?