VTUL / vtechworks

DSpace at Virginia Tech
http://vtechworks.lib.vt.edu
Other
6 stars 8 forks source link

SWORD deposits from ACM troubleshooting #773

Closed pyc1 closed 2 hours ago

pyc1 commented 2 years ago

Tom Gibson at ACM is having trouble sending SWORD v2 submissions to us:

The submission I made was via SWORD v2. After doing so the response did some favorable – a large block of text including “200 OK”. I’ve copied it all into the attached txt. The one thing that struck me as odd are the special characters found in some of the script tags. Do you need to whitelist some of our IP addresses? That’s something I had to do with another institution I was making SWORD submissions to. Below is the string I’m using to perform the bin/curl submission:

-i --data-binary ""@/var/www/cfapi/repositoryManagement/files/3442381.3450060/3442381.3450060.zip"" -H ""Content-Disposition: attachment; filename=3442381.3450060.zip"" -H ""Packaging: http://purl.org/net/sword/package/METSDSpaceSIP"" -u acmopen@hq.acm.org:password -X POST https://vtechworks.lib.vt.edu/handle/10919/105038

Please let me know if you have any thoughts – you’re the second institution I’ve had to setup a SWORD arrangement with, so this may take some trial and error. If you feel it will help, I’d be happy to hop onto something like zoom so we can work together in real time.

pyc1 commented 2 years ago

swordResponse.txt

alawvt commented 2 years ago

Tom, It looks like what you received was the HTML for the collection page, https://vtechworks.lib.vt.edu/handle/10919/105038.

I am able to load art_4959256494595851995.zip, which contains PDF and mets.xml, with the following command using sword v1:

curl -i --data-binary "@art_4959256494595851995.zip" -H "Content-Disposition: filename=art_4959256494595851995.zip" -H "Content-Type: application/zip" -H "X-Packaging: http://purl.org/net/sword-types/METSDSpaceSIP" -H "X-No-Op: false" -H "X-Verbose: true" -u alaw@vt.edu:password -X POST https://vtechworks.lib.vt.edu/sword/deposit/10919/105038

However, following SWORD 2.0 Profile - Creating a Resource with a Binary File Deposit, if I use:

curl -i --data-binary "@SV.2021.21548575.zip" -H "Content-Disposition: filename=SV.2021.2154857.zip" -H "Content-Type: application/zip" -H "X-Packaging: http://purl.org/net/sword-types/METSDSpaceSIP" -H "X-No-Op: false" -H "X-Verbose: true" -u alaw@vt.edu:password -X POST https://vtechworks.lib.vt.edu/swordv2/collection/10919/105038

the file is deposited but the metadata is not parsed. I do not know why.

Of our other SWORD depositors, two use SWORD v1 and deposit a zip containing the PDF and mets.xml.

There is one SWORD v2 depositor who deposits the zip and an extra XML file containing metadata. I believe they are using the Atom Multipart Deposit, SWORD 2.0 Profile - Creating a Resource with a Multipart Deposit.

Since we only receive the documents, I do not know the details of their implementations.

alawvt commented 2 years ago

I sent a query to the DSpace tech listserv, SWORD v2 zip submission fails to parse mets.xml.

alawvt commented 2 years ago

I haven't found any difference in the METS between SWORD v1 or v2. You might try changing the packaging header directive to 'Packaging' instead of 'X-Packaging'. Apparently -H "Packaging:...." was required for v2. On 6.3 we've had success with:

/usr/bin/curl --basic --user myn...@mit.edu:$mypass -i -T "./PhysRevB.99.075430-mets.zip" -H "Content-Disposition:attachment; filename=PhysRevB.99.075430-mets.zip" -H "Content-Type:application/zip" -H "Packaging:http://purl.org/net/sword/package/METSDSpaceSIP" -H "X-No-Op:false" -vvv -X POST https://dspace.mit.edu/swordv2/collection/1721.1/121131

I also tested v2 on beta dspace 7.* a while back and that worked as well.

Hopefully that helps.

Carl

alawvt commented 2 years ago

Carl,

Thank you very much for your help which resolved my issue. Indeed, -H "Packaging:http://purl.org/net/sword/package/METSDSpaceSIP" seems to be required for SWORDv2 and -H "X-Packaging: http://purl.org/net/sword-types/METSDSpaceSIP" is required for SWORD. X-Packaging and the URLs are different.

So, to summarize:

curl -i --data-binary "@art_4959256494595851995.zip" -H "Content-Disposition: filename=art_4959256494595851995.zip" -H "Content-Type: application/zip" -H "X-Packaging: http://purl.org/net/sword-types/METSDSpaceSIP" -H "X-No-Op: false" -H "X-Verbose: true" -u email@vt.edu:password -X POST https://vtechworks.lib.vt.edu/sword/deposit/10919/105038

Yields HTTP 202 and correctly parsed metadata.

curl -i --data-binary "@art_4959256494595851995.zip" -H "Content-Disposition: filename=art_4959256494595851995.zip" -H "Content-Type: application/zip" -H "Packaging:http://purl.org/net/sword/package/METSDSpaceSIP" -H "X-No-Op: false" -H "X-Verbose: true" -u email@vt.edu:password -X POST https://vtechworks.lib.vt.edu/swordv2/collection/10919/105038

Yields HTTP 201 and correctly parsed metadata.

alawvt commented 2 years ago

The mets.xml file is parsed by DSpace for metadata for both SWORD and SWORDv2. The extra XML file deposited by our SWORDv2 submitter is not parsed upon upload but is made available, e.g. https://vtechworks.lib.vt.edu/handle/10919/105028.

BioMed Central and MDPI use SWORD v1. Hindawi uses SWORD v2.

alawvt commented 2 years ago

We received it but only this parsed:

Democratizing Cellular Access with CellBricks

dc.description.provenance Submitted by ACM SWORD (acmopen@hq.acm.org) on 2021-10-05T20:32:16Z No. of bitstreams: 2 3452296.3473336.pdf: 1842050 bytes, checksum: 959bbeaedfb1e4248b705cfb21b0bf67 (MD5) 3452296.3473336.zip: 1796525 bytes, checksum: 93acadf9ddbfb2116caa2337b3a9f2ec (MD5) en
dc.title Democratizing Cellular Access with CellBricks  
dc.date.updated 2021-10-05T20:32:16Z

This one is declared UTF-16 and is saved as UTF-8, so I guess it can work. But almost everything we do and get it UTF-8, so I recommend that.

alawvt commented 2 years ago

I think you may have called it with this UTF encoding issue – I’ve made an adjustment to keep it at 8. I’ve also just submitted a paper successfully. How do things look? - Tom

alawvt commented 2 years ago

It seems like we are getting the items consistently now. The latest, 3409118.3475142.zip, is declared as UTF-8 and actually is.

Only a few fields in mets.xml are parsed by our crosswalk, sword-swap-ingest.xsl, because only the title, id, and date are in the crosswalk. I suggest using the fields in the crosswalk, as much as possible. You can include other fields and we may modify the crosswalk to utilize them but that wouldn't happen immediately. If you develop a tag set that matches the crosswalk it should work for all DSpace repositories, since this is the default crosswalk that comes with DSpace. I can give you feedback on the mets files you send. I think it might also be possible to use the DSpace 6.3 demo site, https://demo.dspace.org/xmlui/ to test, too. There, you could deposit to a collection and see the submission yourself. It might be instructive to see it from the DSpace side.

alawvt commented 2 years ago

Which fields among those you see in the XML do you want conveyed? I ask because there are a number of datapoints that I’ve tried sending that don’t have representation on the sword-swap-ingest page you linked me to. Data points such as DOI, which eRights form was selected, an array of author information, the paper’s publisher, I could go on.

Is this something you encounter often or am I missing something? - Tom

alawvt commented 2 years ago

In general, we want as much metadata as possible. If a metadata value can't be added with the tags in sword-swap-ingest.xsl, it is fine to add them and we'll try to improve that crosswalk to map them later. It would be great if the extra tags matched those sent by other vendors which are listed in issue, #720.

I have attached an annotated mets.xml file for 3409118.3475142.zip with our suggestions for tagging.

Also, you can use the two zip files from the other vendors that I sent you and the third one attached as examples.

alawvt commented 2 years ago

When I referred to dc.title, that is the destination field for the title. The BiomedCentral sample attached, art_4959256494595851995.zip, was deposited in VTechWorks at https://vtechworks.lib.vt.edu/handle/10919/78663?show=full.

The mets.xml file references the epdcx (ePrints Dublin Core) metadata schema

xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/"

which defines this field.

The mets.xml file is processed by sword-swap-ingest.xsl which also references

xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/"

All the fields will need to be sent in the form of the BioMedCentral example file. I suggest just adding one field to it and making sure that works first, perhaps

<epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/creator">
<epdcx:valueString>Pancotto, Theresa E</epdcx:valueString>
</epdcx:statement>