gdcc / sword2-server

(Forked) Java Server Library for supporting integration with SWORDv2
Apache License 2.0
0 stars 1 forks source link

upload size limit #43

Open pdurbin opened 2 years ago

pdurbin commented 2 years ago

For now this is a placeholder issue. Today @landreev @scolapasta and I were talking about how in practical terms, files are limited to 2 GB when uploading to Dataverse via SWORD because getMaxUploadSize returns an int.

Below is the code from https://github.com/IQSS/dataverse/blob/v5.9/src/main/java/edu/harvard/iq/dataverse/api/datadeposit/SwordConfigurationImpl.java#L123-L146

To summarize:


@Override
public int getMaxUploadSize() {

    int unlimited = -1;
    /* It doesn't look like we can determine which store will be used here, so we'll go with the default
     * (It looks like the collection or study involved is available where this method is called, but the SwordConfiguration.getMaxUploadSize()
     * doesn't allow a parameter)
     */ 
    Long maxUploadInBytes = systemConfig.getMaxFileUploadSizeForStore("default");

    if (maxUploadInBytes == null){
        // (a) No setting, return unlimited           
        return unlimited;      

    }else if (maxUploadInBytes > Integer.MAX_VALUE){
        // (b) setting returns the limit of int, return max int value  (BUG)
        return Integer.MAX_VALUE;

    }else{            
        // (c) Return the setting as an int
        return maxUploadInBytes.intValue();

    }
}
landreev commented 2 years ago

The part about "if you don't set the limit explicitly, it is unlimited" - do you know offhand if anyone has actually tested it? As in, try to run it without the limit set, and upload, say, a 5GB file? What I'm getting at, I would imagine it is possible that this limit being an int instead of a long may not be the only limiting bottleneck inside their code; the library is quite old, and was clearly not written for depositing large volumes of data. So I would not be surprised if there were another 2GB or 4GB limit elsewhere. But if it were the only such bottleneck, then this would be quite trivial.

pdurbin commented 2 years ago

@landreev when SWORD was new in DVN 3.x I remember trying a large file locally and it working but I can't remember exactly how many gigabytes. I want to say 4 or 5 GB. Obviously, this was a long time ago. I haven't done any recent testing.

poikilotherm commented 2 years ago

Aside of the limit not being associated with a Dataverse collection, there seems to be a misunderstanding.

The reported maximum upload size is in kilobyte, not byte:

The SWORD server MAY specify the sword:maxUploadSize (in kB) of content that can be uploaded in one request [SWORD003] as a child of the app:service element. If provided this MUST contain an integer.

(See 6.1. Retrieving a Service Document)

It's wrong in the code!

https://github.com/gdcc/sword2-server/blob/fca2c8a06640d81fe57b107216befcb0061d2044/src/main/java/org/swordapp/server/SwordAPIEndpoint.java#L185

This means, max integer is capable of 2TB.

poikilotherm commented 2 years ago

Reporting the gist of our Slack tech discussion yesterday:

sequenceDiagram
  actor up as ZIP Upload
  participant servlet as dv.SWORDv2MediaResourceServlet
  participant mapi as sw2s.MediaResourceApi
  participant sae  as sw2s.SwordApiEndpoint
  participant mrmi as dv.MediaResourceManagerImpl
  participant fu as dv.FileUtil

  up ->> servlet: doPost()
  servlet ->> mapi: post()
  mapi ->> sae: addDepositPropertiesFromBinary()
  sae ->> sae: storeAndCheckBinary()
  note right of sae: Store ZIP upload as SWORD temp file

  mapi ->> mrmi: addResource()
  mrmi ->> mrmi:  replaceOrAddFiles()
  mrmi ->> mrmi: Identify target dataset
  mrmi ->> fu: createDataFiles()
  fu ->> fu: Retrieve target store size limit
  fu ->> fu: Copy SWORD temp file to another temp file
  fu ->> fu: Unzip
  fu ->> fu: For each file check limit and store
poikilotherm commented 2 years ago

@landreev @pdurbin is this still a thing for you?

May I propose a rather simple change?

public interface SwordConfiguration {
...
-     int getMaxUploadSize();
+     long getMaxUploadSize();
...
}

The comparison happens in bytes. To be able to use more than max int (2,147,483,647), we would be limited to max long (9,223,372,036,854,775,807) instead and have only minimal code changes and a simple workaround for now.

pdurbin commented 2 years ago

@poikilotherm yes! I think this is exactly what we want. Obviously, we'll need to make the change on the Dataverse side as well. When we're ready, we should re-open this issue (or create a new one, no strong preference), and get it put into a sprint:

poikilotherm commented 2 years ago

Thx @pdurbin

Do you want this on the main branch right away or for the jakarta branch only? (Which will become the new main once we need it for DV)

@pdurbin said on 2022-05-16 via Matrix: let's do this in a future branch, neither main nor jakarta.