eprints / eprints3.4

EPrints 3.4 core and releases
http://www.eprints.org/uk/index.php/eprints-3-4/
GNU Lesser General Public License v3.0
31 stars 28 forks source link

Zero byte files from Pure Deposit API #379

Closed aalbinclark closed 4 months ago

aalbinclark commented 7 months ago

There is an issue with Pure (5.28.2) depositing files in EPrints (3.4.5). The file storage repeatedly fails for an eprint before eventually working. There is no obvious pattern.

EPrints shows the failed uploads as duplicate documents with zero byte files, eventually showing the successfully uploaded file: https://eprints.lancs.ac.uk/id/eprint/216341/

Pure audit log (reverse chronological order)

17/03/24 0:04
preservedContent:createFile
success

File ElectronicVersionFile(id=409574281, fileName=2024SultanPhD.PDF.pdf, sizelong=64496457) in content Thesis(id=409574279, title=Nonlinear and nonautonomous dynamical properties of the cardiovascular response to a range of ambient temperatures) successfully stored in eprints (store id 216341/123809)
Preserved content update as root
Content ID: 409574279
XML
<?xml version="1.0" encoding="UTF-8"?>

<storage>
  <operation>storeFile</operation>
  <storeFileId>216341/123809</storeFileId>
  <contentUuid>dc1728b3-d4e5-4836-bc37-f517fd0f26cd</contentUuid>
  <family>ResearchOutput</family>
</storage>

16/03/24 0:07
preservedContent:createFile
error

File ElectronicVersionFile(id=409574281, fileName=2024SultanPhD.PDF.pdf, sizelong=64496457) in content Thesis(id=409574279, title=Nonlinear and nonautonomous dynamical properties of the cardiovascular response to a range of ambient temperatures) storage in eprints failed. Error was: Unable to store file in repository
Preserved content update as root
Content ID: 409574279
XML
<?xml version="1.0" encoding="UTF-8"?>

<storage>
  <operation>storeFile</operation>
  <storeFileId>none</storeFileId>
  <reason>Unable to store file in repository</reason>
  <exception><![CDATA[dk.atira.pure.api.shared.services.file.permanentstore.exception.PermanentStoreException: Unable to access https://eprints.lancs.ac.uk/pda/handler/eprint/216341/documents/2024SultanPhD.PDF.pdf : eprints.lancs.ac.uk:443 failed to respond
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.execute(EPrintsClientPDAImpl.java:371)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.execute(EPrintsClientPDAImpl.java:327)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.executePost(EPrintsClientPDAImpl.java:314)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.addFile(EPrintsClientPDAImpl.java:178)
    at dk.atira.pure.modules.storage.internal.eprints.repository.EprintsRepositoryFileStoreAdapter.createFile(EprintsRepositoryFileStoreAdapter.java:203)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.createFiles(ContentStorageSynchronizationHandlerImpl.java:196)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.updateContent(ContentStorageSynchronizationHandlerImpl.java:179)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.synchronize(ContentStorageSynchronizationHandlerImpl.java:83)
    at dk.atira.pure.server.services.impl.file.InternalFileServiceImpl.synchronizeWithStores(InternalFileServiceImpl.java:756)
    ...
]]></exception>
  <operationAttempts>3</operationAttempts>
  <contentUuid>dc1728b3-d4e5-4836-bc37-f517fd0f26cd</contentUuid>
  <family>ResearchOutput</family>
</storage>

15/03/24 0:07
preservedContent:createFile
error

File ElectronicVersionFile(id=409574281, fileName=2024SultanPhD.PDF.pdf, sizelong=64496457) in content Thesis(id=409574279, title=Nonlinear and nonautonomous dynamical properties of the cardiovascular response to a range of ambient temperatures) storage in eprints failed. Error was: Unable to store file in repository
Preserved content update as root
Content ID: 409574279
XML
<?xml version="1.0" encoding="UTF-8"?>

<storage>
  <operation>storeFile</operation>
  <storeFileId>none</storeFileId>
  <reason>Unable to store file in repository</reason>
  <exception><![CDATA[dk.atira.pure.api.shared.services.file.permanentstore.exception.PermanentStoreException: Unable to access https://eprints.lancs.ac.uk/pda/handler/eprint/216341/documents/2024SultanPhD.PDF.pdf : eprints.lancs.ac.uk:443 failed to respond
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.execute(EPrintsClientPDAImpl.java:371)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.execute(EPrintsClientPDAImpl.java:327)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.executePost(EPrintsClientPDAImpl.java:314)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.addFile(EPrintsClientPDAImpl.java:178)
    at dk.atira.pure.modules.storage.internal.eprints.repository.EprintsRepositoryFileStoreAdapter.createFile(EprintsRepositoryFileStoreAdapter.java:203)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.createFiles(ContentStorageSynchronizationHandlerImpl.java:196)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.updateContent(ContentStorageSynchronizationHandlerImpl.java:179)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.synchronize(ContentStorageSynchronizationHandlerImpl.java:83)
    at dk.atira.pure.server.services.impl.file.InternalFileServiceImpl.synchronizeWithStores(InternalFileServiceImpl.java:756)
    ...
]]></exception>
  <operationAttempts>2</operationAttempts>
  <contentUuid>dc1728b3-d4e5-4836-bc37-f517fd0f26cd</contentUuid>
  <family>ResearchOutput</family>
</storage>

14/03/24 2:30
preservedContent:createFile
error

File ElectronicVersionFile(id=409574281, fileName=2024SultanPhD.PDF.pdf, sizelong=64496457) in content Thesis(id=409574279, title=Nonlinear and nonautonomous dynamical properties of the cardiovascular response to a range of ambient temperatures) storage in eprints failed. Error was: Unable to store file in repository
Preserved content update as root
Content ID: 409574279
XML
<?xml version="1.0" encoding="UTF-8"?>

<storage>
  <operation>storeFile</operation>
  <storeFileId>none</storeFileId>
  <reason>Unable to store file in repository</reason>
  <exception><![CDATA[dk.atira.pure.api.shared.services.file.permanentstore.exception.PermanentStoreException: Unable to access https://eprints.lancs.ac.uk/pda/handler/eprint/216341/documents/2024SultanPhD.PDF.pdf : eprints.lancs.ac.uk:443 failed to respond
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.execute(EPrintsClientPDAImpl.java:371)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.execute(EPrintsClientPDAImpl.java:327)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.executePost(EPrintsClientPDAImpl.java:314)
    at dk.atira.pure.modules.storage.internal.eprints.client.EPrintsClientPDAImpl.addFile(EPrintsClientPDAImpl.java:178)
    at dk.atira.pure.modules.storage.internal.eprints.repository.EprintsRepositoryFileStoreAdapter.createFile(EprintsRepositoryFileStoreAdapter.java:203)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.createFiles(ContentStorageSynchronizationHandlerImpl.java:196)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.createContent(ContentStorageSynchronizationHandlerImpl.java:381)
    at dk.atira.pure.server.services.impl.file.ContentStorageSynchronizationHandlerImpl.synchronize(ContentStorageSynchronizationHandlerImpl.java:80)
    at dk.atira.pure.server.services.impl.file.InternalFileServiceImpl.synchronizeWithStores(InternalFileServiceImpl.java:756)
    ...
]]></exception>
  <operationAttempts>1</operationAttempts>
  <contentUuid>dc1728b3-d4e5-4836-bc37-f517fd0f26cd</contentUuid>
  <family>ResearchOutput</family>
</storage>
jesusbagpuss commented 7 months ago

@aalbinclark does this relate to the filesize in any way? The example is a thesis, which could be a large document. It would be useful to get the corresponding lines from the EPrints/Apache logs to see whether the EPrints server is returning an error response, or if e.g. the transfer is timing-out.

[Edit: in the logs provided, there is a 'sizelong=64496457' - which looks like 64M. Not that big in the grand scheme of things, but possibly big enough to cause timeouts]

drn05r commented 7 months ago

@aalbinclark EPrints Apache error log messages would be really useful. (If you have separate error logs for HTTP and HTTPS be sure to check both). Also, it would be useful if you could confirm whether you have only just setup Pure to deposit to EPrints or whether the issue has only occurred since upgrading either Pure or EPrints. If the issue has only occurred since an upgrade to one of the systems, please let us know what version you were running prior to the upgrade.

aalbinclark commented 6 months ago

EPrints has been running for over fifteen years and has been updated via the Pure Deposit API for many years. We installed a new instance of EPrints (3.4.5) mid-November 2023.

Pure EPrints Issue?
5.27.1-2 3.4.2 No
5.27.3 3.4.2 No
5.27.3 3.4.5 Yes
5.28.2 3.4.5 Yes

I've compiled a fuller set of logs for another eprint https://eprints.lancs.ac.uk/id/eprint/217096/. File size is only 6MB.

Pure audit logs

EPrints Apache ssl_access_log

148.88.22.127 - - [27/Mar/2024:15:40:05 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [28/Mar/2024:01:20:24 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [28/Mar/2024:01:20:24 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [28/Mar/2024:01:20:24 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [29/Mar/2024:01:46:01 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [29/Mar/2024:01:46:01 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [29/Mar/2024:01:46:02 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [30/Mar/2024:01:21:56 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [30/Mar/2024:01:21:56 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [30/Mar/2024:01:21:57 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [31/Mar/2024:01:34:13 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [31/Mar/2024:01:34:13 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [31/Mar/2024:01:34:13 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [02/Apr/2024:01:03:37 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [02/Apr/2024:01:03:37 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [02/Apr/2024:01:03:37 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [03/Apr/2024:00:36:24 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [03/Apr/2024:00:36:24 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [03/Apr/2024:00:36:25 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [04/Apr/2024:00:27:32 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [04/Apr/2024:00:27:32 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [04/Apr/2024:00:27:33 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [05/Apr/2024:00:33:33 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [05/Apr/2024:00:33:33 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [05/Apr/2024:00:33:34 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [06/Apr/2024:00:18:10 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [06/Apr/2024:00:18:10 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [06/Apr/2024:00:18:10 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [07/Apr/2024:00:02:35 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [07/Apr/2024:00:02:35 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [07/Apr/2024:00:02:36 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [07/Apr/2024:23:57:35 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [07/Apr/2024:23:57:35 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [07/Apr/2024:23:57:36 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 401 -
148.88.22.127 - - [07/Apr/2024:23:57:36 +0000] "POST /pda/handler/eprint/217096/documents/2024DolmorMScRes.pdf?license=creative_commons_attribution_noncommercial_4_0_international_license&visibility=public&description=2024DolmorMScRes&documentVersion=published HTTP/1.1" 200 92
10.56.49.214 - - [09/Apr/2024:09:39:45 +0000] "GET /id/eprint/217096/12/2024DolmorMScRes.pdf HTTP/1.1" 200 6052873
148.88.22.127 - - [09/Apr/2024:09:40:38 +0000] "GET /pda/handler/eprint/217096/documents/124480 HTTP/1.1" 401 -
148.88.22.127 - - [09/Apr/2024:09:40:38 +0000] "GET /pda/handler/eprint/217096/documents/124480 HTTP/1.1" 200 6052873
148.88.22.127 - - [09/Apr/2024:09:45:05 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 401 -
148.88.22.127 - - [09/Apr/2024:09:45:05 +0000] "PUT /pda/handler/eprint/217096 HTTP/1.1" 200 129
148.88.22.127 - - [09/Apr/2024:09:45:06 +0000] "PUT /pda/handler/eprint/217096/documents/124480 HTTP/1.1" 401 -
148.88.22.127 - - [09/Apr/2024:09:45:06 +0000] "PUT /pda/handler/eprint/217096/documents/124480 HTTP/1.1" 200 127
jesusbagpuss commented 6 months ago

@aalbinclark - for the Pure connector, I normally expect to see a 401 (unauthorised) followed by a 200 response. For the failing POST requests, EPrints responds with the normal 401, but there isn't a subsequent request (with the required auth, that would then get the 200 response).

This could mean:

The POST on 7th April did have the follow-up request - which got a 200 response, and looks like it was successful. The GET on the 9th resulted in a 6M response.

What changed between the 6th and 7th - either in EPrints, Pure, or channels between them?

One thing from the logs is the timing of the deposit - shortly before midnight, to shortly after midnight. Do any jobs start on the EPrints server at midnight that might impact things e.g. database or storage backups?


In case it's useful, I've got a modified version of the PDA modules that allows you keep the raw files as sent across from Pure - both the EPrints XML and the Document payloads.

Search this gist: https://gist.github.com/jesusbagpuss/757138e884eba03a9ad30d7a1df52bc6 for 'pda_keep_tmp_files' to see what's needed. The files will be in /tmp/ (possibly under a systemd...httpd directory) with a filename that starts pdahandler.

It might be useful to see exactly what's being sent. If that all looks normal, then the files can be used to put together some tests using e.g. Postman.

Happy to describe the modified PDA code in more detail if you want. Will - drop me a line J.Salter@leeds.ac.uk if you want.

drn05r commented 6 months ago

@aalbinclark It is important to appreciate that there are three components here not just two:

  1. Pure
  2. EPrints
  3. PDA Handler (a.k.a. Pure Connector)

The PDA Handler was developed by Pure rather than using EPrints REST API (like I think @jesusbagpuss's implementation uses). This means in it is not officially supported by EPrints. It was only quite recently that I agreed with Pure that we could make a reference version of the PDA Handler available on EPrints GitHub Organisation.

As @jesusbagpuss has already said you would expect 401s followed by 200s so at an HTTP interface level everything seems to be working as expected. The issue looks like it might lie with how the PDA Handler calls whatever functions it uses to store a file to disk. I am not aware of any changes off hand that would have affected calls to EPrints functions that store files to disk, especially not any which would cause an intermittent issue like this. As @jesusbagpuss previously queried, I could believe an issue were smaller files uploading successfully but larger ones always failing but the same file uploading sometimes and not others feels unlikely to be a coding issue and more likely an enviromental one.

I would advise that you compare the reference version of the PDA Handler against what you are currently running on your repository to see if there are any differences. This will at least allow us to rule out any bug that may have been fixed. Unfortunately the latest version I am aware of that uses this reference version of the PDA Handler is only on version 3.4.3. To my knowledge there has not been any intermittent issues with zero byte files for this version.

It would also be useful that you could confirm that the only thing that has changed between the tests you did with EPrints 3.4.2 and EPrints 3.4.5. Did you tests run on the same version of operating system with similar configuration (e.g. SELinux state)? Were the underlying applications (e.g. Apache, mod_perl, MySQL, etc.) the same versions and running much the same configuration?

Finally, although the access logs look at expected, some warnings may still be being recorded in the Apache error logs, please can you check if there is you can find anything.

P.S. Just looking at the access log more closely, it looks like PUT commands always have a 401 followed by a 200 but the POST commands to upload a file almost always have a 401 but then no further 200, except at 07/Apr/2024:23:57:36. From what I have experienced from how Pure uses the PDA Handler, it never send credentials until it is explicitly told to do so with a 401 response. So what is odd here is what changes Pure's behaviour at 07/Apr/2024:23:57:36. I cannot think of any reason for this, unless something has changed about EPrint record 217096. It may be worth checking the history tab for this eprint to see if there are any changes between 07/Apr/2024:00:02:36 and 07/Apr/2024:23:57:36. Alternatively maybe something about the equivalent record in Pure has changed which means it is will to try uploading the file with credentials, when before it was not.

drn05r commented 4 months ago

Closing as this is not a development task or a bug directly related to the EPrints codebase (i.e. it involves third-party application/plugins)