gdcc / python-dvuploader

✈️ - Python package for parallel direct upload to Dataverse
MIT License
4 stars 0 forks source link

Message 'Not Implemented' #7

Closed DonRichards closed 6 months ago

DonRichards commented 7 months ago

Not sure what this error indicates.

I'm trying to upload FITS files to a DOI. I can use the UI and it uploads without an issue.

JR-1991 commented 7 months ago

@DonRichards, thank you for submitting the issue! Based on the error message you provided, it seems to originate from the AWS store. Unfortunately, the error message "Not implemented" is difficult to interpret. Could you please provide me with the version of Dataverse you are using?

I ran some local tests using Dataverse 6.0 and Localstack, which act as a simulation of AWS. However, I was unable to replicate the error. Both direct uploads to the S3 store and the native upload path worked. I plan to conduct further testing on an actual AWS store and hopefully identify the bug causing the issue.

I assume it does since I can use the UI for the same files.

As far as I know, the UI does not support direct uploads to an S3 store. Therefore, the uploads are done through the standard HTTP method available in DV's native API. This clarifies why the UI functions properly and suggests that the issue might lie with the AWS store.

DonRichards commented 7 months ago

I found something odd when I changed a variable name within my code I got a different error. From DVUploader(files=files) to DVUploader(files=upload_files). Not sure what this indicates.

When I examined the files being passed to the upload it looks like this. Do these values look correct? fileName and file_id I would expect to have something.

 File(
    filepath='/mnt/FitsFiles/Platinum_2416.fits',
    description='Posterior distributions of the stellar parameters for the star with ID from the Gaia DR3 catalog Platinum_2416.',
    directoryLabel='',
    mimeType='image/fits',
    categories=['DATA'],
    restrict=False,
    checksum_type=<ChecksumTypes.MD5: ('MD5',<built-in function openssl_md5>)>,
    storageIdentifier=None,
    fileName=None,
    checksum=None,
    to_replace=False,
    file_id=None
 ),
JR-1991 commented 7 months ago

@DonRichards this is expected since fileName and file_id are populated upon upload when hashes are calculated. Do you think this is confusing? I am happy to change it to extracting the filename when initialized.

Can you share the error message you have received upon changing variable names?

DonRichards commented 7 months ago

It starts to upload but then throws this and exits An error occurred with uploading: Cannot write to closing transport

Screenshot from 2024-02-09 14-14-34

JR-1991 commented 7 months ago

I came across this issue on StackOverflow, and found a solution provided by another user. I will implement the fix and create a pull request to see if it resolves the issue.

May I ask about your file size to test this on another server?

DonRichards commented 7 months ago

Each of the 401,000 files I'm attempting to upload with a single DOI is approximately 1.6MB in size. I have a script that is breaking them up in batches of 20 at a times so the uploader should only be given a list of 20 files. Any idea what I can do from here to get this to work? I'd create a PR if I could but I don't know this app well enough.

JR-1991 commented 7 months ago

Great, thanks for the info! The PR is almost ready for submission. I'll run some tests on Demo Dataverse to check for any issues. Once I'm done, I'll let you know and you can test the updated version. Hope this will fix it 😊

DonRichards commented 7 months ago

Great! Thanks for that!

JR-1991 commented 7 months ago

@DonRichards, I have created a pull request #8 that fixes the issue. Unfortunately, the issue is related to streaming files to the S3 backend. AWS is not capable of handling async streams, which is a pity.

To test this, I downloaded a sample FITS file and replicated it 2000 times to simulate a case similar to yours. The error has not been raised on our test server, and the upload works. The upload to S3 itself is quite fast if you set n_parallel_uploads to 30, but the only downside is that registering the uploaded files at Dataverse takes considerable time. DVUploader has no influence on the time it takes, unfortunately.

Can you test and verify that it works on your side?

Regarding the bulk upload in general, would it be an option to use Dataverse's native upload instead? This library supports automatic zipping of multiple file batches of max. 2 GB, which are unzipped on Dataverse's side if the direct upload is not enabled. This way, you may overcome the additional time to register files using direct upload.

DonRichards commented 7 months ago

Dumb question, how do I test the PR? Should I clone this repo and do something to my code to use the cloned repo instead of the python library? Googled it