GlobalDataverseCommunityConsortium / dataverse-uploader

Upload local folder/directory trees to Dataverse or Clowder repositories.
https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader/wiki/DVUploader,-a-Command-line-Bulk-Uploader-for-Dataverse
Apache License 2.0
16 stars 8 forks source link

Uploading over 2 gb #4

Open jmjamison opened 4 years ago

jmjamison commented 4 years ago

I'm trying to use DVUploader for large geodatabases (gdb files). This works fine with files around 2 gb but over that, for example 2.87 gb.

Over that and i get: Jun 08, 2020 1:32:54 PM org.apache.http.impl.execchain.RetryExec execute INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://dataverse.ucla.edu:443: Software caused connection abort: socket write error Jun 08, 2020 1:32:54 PM org.apache.http.impl.execchain.RetryExec execute

This is an aws s3 bucket and I've raised :MaxFileUploadSizeInBytes to 8gb but that doesn't seem to help.

qqmyers commented 4 years ago

FWIW: The current DVUploader is limited to < 5GB on AWS S3 buckets when using direct upload (because AWS doesn't allow uploads above that without splitting the upload into multiple pieces.) I'm currently testing code to use multipart uploads that will remove that limit.

That said, 2.87 GB should work. Are you using direct upload? If not, my guess would be that some software in your setup has a timeout that is cutting off the upload - either in the web server or the ajp connection to glassfish, a load balancer, etc. Or, it could be that you're running out of space in your temp directory for Dataverse (it will have two temporary copies somewhere on disk). If you are using direct upload, I'm not sure what could be timing out - possibly a proxy server if you use that. I'm no One way you might get some info on what software is timing out - the response headers that can be seen in the browser console (when uploading via the Dataverse UI rather than DVUploader) or using curl with the -v option usually has a Server: entry. If something is timing out the response header will show which software responded in that Server entry. For example, I think we saw the AWS LB responding when QDR had timeouts (versus 'Server:Apache' for successful calls).

I'm not sure that the DVUploader reports that information. The DVUploader does write more information to it's log file than it prints to the console when you run it so it may be there's a clue there. If not you may want to try using the Dataverse UI or curl as a way to debug (and we may want to add more debug info to DVUploader. If it turns out not to be a timeout issue, I can certainly go into DVUploader and see what other information we might be able to print out when a failure happens.)

jmjamison commented 4 years ago

At your suggestion I tried uploading from the Dataverse UI and I got a size error. So, that gives me someplace to start looking. Thank you for the suggestions. If/when I can track this down I'll post the answer here - in case someone else runs into this.

pkiraly commented 3 years ago

I hve a similar problem. There is no :MaxFileUploadSizeInBytes so according to the Dataverse manual: "If the MaxFileUploadSizeInBytes is NOT set, uploads, including SWORD may be of unlimited size." We use normal file system, not S3. When I try uploading an 8 GB file, on the client side I get the following error:

PROCESSING(F): oa_status_by_doi.csv.gz
               Does not yet exist on server.
May 19, 2021 6:01:44 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://test.data.gro.uni-goettingen.de:443: Broken pipe (Write failed)
May 19, 2021 6:01:44 PM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {s}->https://test.data.gro.uni-goettingen.de:443
May 19, 2021 6:17:15 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://test.data.gro.uni-goettingen.de:443: Broken pipe (Write failed)
May 19, 2021 6:17:15 PM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {s}->https://test.data.gro.uni-goettingen.de:443
...

On the server log I found these errors:

[2021-05-19T18:16:45.246+0200] [Payara 5.2021.1] [SEVERE] [] [] [tid: _ThreadID=99 
      _ThreadName=http-thread-pool::jk-connector(1)] [timeMillis: 1621441005246] [levelValue: 1000] [[
  java.io.IOException: java.lang.InterruptedException
    at org.glassfish.grizzly.nio.transport.TCPNIOTransportFilter.handleRead(TCPNIOTransportFilter.java:68)
        ....
    at edu.harvard.iq.dataverse.api.ApiBlockingFilter.doFilter(ApiBlockingFilter.java:168)
        ....

then

[2021-05-19T18:16:45.249+0200] [Payara 5.2021.1] [SEVERE] [] [edu.harvard.iq.dataverse.api.errorhandlers.ThrowableHandler]
 [tid: _ThreadID=99 _ThreadName=http-thread-pool::jk-connector(1)]
 [timeMillis: 1621441005249] [levelValue: 1000] [[
  _status="ERROR";
  _code=500;
  _message="Internal server error. More details available at the server logs.";
  _incidentId="65718191-522f-4ef0-be10-df3b471d0534";
  _interalError="IOException";
  _internalCause="InterruptedException";
  _requestUrl="https://test.data.gro.uni-goettingen.de/api/v1/datasets/:persistentId/add?persistentId=...&key=...";
  _requestMethod="POST"|]]

(I remove identifiers from this snippet, and added some formatting.)

jmjamison commented 3 years ago

Keep in mind that I'm a user not a developer. That said I was able to manage large uploads by setting up a direct upload s3 store.
As I understand the problem - using the web interface, uploads go through temp storage on the way to the s3 store. The temp storage runs out of space. [Here is where a developer can jump in and correct my description.] Hope this helps some.