Open pdurbin opened 2 years ago
The part about "if you don't set the limit explicitly, it is unlimited" - do you know offhand if anyone has actually tested it? As in, try to run it without the limit set, and upload, say, a 5GB file? What I'm getting at, I would imagine it is possible that this limit being an int instead of a long may not be the only limiting bottleneck inside their code; the library is quite old, and was clearly not written for depositing large volumes of data. So I would not be surprised if there were another 2GB or 4GB limit elsewhere. But if it were the only such bottleneck, then this would be quite trivial.
@landreev when SWORD was new in DVN 3.x I remember trying a large file locally and it working but I can't remember exactly how many gigabytes. I want to say 4 or 5 GB. Obviously, this was a long time ago. I haven't done any recent testing.
Aside of the limit not being associated with a Dataverse collection, there seems to be a misunderstanding.
The reported maximum upload size is in kilobyte, not byte:
The SWORD server MAY specify the sword:maxUploadSize (in kB) of content that can be uploaded in one request [SWORD003] as a child of the app:service element. If provided this MUST contain an integer.
(See 6.1. Retrieving a Service Document)
It's wrong in the code!
This means, max integer is capable of 2TB.
Reporting the gist of our Slack tech discussion yesterday:
sequenceDiagram
actor up as ZIP Upload
participant servlet as dv.SWORDv2MediaResourceServlet
participant mapi as sw2s.MediaResourceApi
participant sae as sw2s.SwordApiEndpoint
participant mrmi as dv.MediaResourceManagerImpl
participant fu as dv.FileUtil
up ->> servlet: doPost()
servlet ->> mapi: post()
mapi ->> sae: addDepositPropertiesFromBinary()
sae ->> sae: storeAndCheckBinary()
note right of sae: Store ZIP upload as SWORD temp file
mapi ->> mrmi: addResource()
mrmi ->> mrmi: replaceOrAddFiles()
mrmi ->> mrmi: Identify target dataset
mrmi ->> fu: createDataFiles()
fu ->> fu: Retrieve target store size limit
fu ->> fu: Copy SWORD temp file to another temp file
fu ->> fu: Unzip
fu ->> fu: For each file check limit and store
@landreev @pdurbin is this still a thing for you?
May I propose a rather simple change?
public interface SwordConfiguration {
...
- int getMaxUploadSize();
+ long getMaxUploadSize();
...
}
The comparison happens in bytes. To be able to use more than max int (2,147,483,647), we would be limited to max long (9,223,372,036,854,775,807) instead and have only minimal code changes and a simple workaround for now.
@poikilotherm yes! I think this is exactly what we want. Obviously, we'll need to make the change on the Dataverse side as well. When we're ready, we should re-open this issue (or create a new one, no strong preference), and get it put into a sprint:
Thx @pdurbin
Do you want this on the main
branch right away or for the jakarta
branch only? (Which will become the new main once we need it for DV)
@pdurbin said on 2022-05-16 via Matrix: let's do this in a future branch, neither main nor jakarta.
For now this is a placeholder issue. Today @landreev @scolapasta and I were talking about how in practical terms, files are limited to 2 GB when uploading to Dataverse via SWORD because getMaxUploadSize returns an
int
.Below is the code from https://github.com/IQSS/dataverse/blob/v5.9/src/main/java/edu/harvard/iq/dataverse/api/datadeposit/SwordConfigurationImpl.java#L123-L146
To summarize: