Open jarulsamy opened 2 months ago
I think this makes sense - I don't think we had any explicit reason to still allow uploads through the Dataverse server. In fact, I don't know that we need a new configuration option and perhaps could just disable normal upload when direct is true (if anyone really needs to do a normal upload, one could configure two stores pointing to the same bucket, one with direct true and the other false, and flip to using the direct=false one if/when needed).
FWIW, the latest versions of DVUploader flip to making direct upload the default (I'm not sure about pyDVUploader). There is also the dvwebloader that can be added to Dataverse - it is direct upload only. These don't stop using the API to upload through the server, but they are ways to guide users away from doing that.
Overview of the Feature Request
Be able to restrict API uploads to require direct uploads for some Dataverses.
(Apologies if I somehow missed this already existing, I could not find docs for this)
What kind of user is the feature intended for? (Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)
What inspired the request?
By default, if a Dataverse is using S3 backend-storage with
upload-redirect=true
, uploads via the WebUI automatically use direct upload. However, when a user uses the file upload APIs, they are able to circumvent direct upload and upload through the Dataverse instance. Other than being slower, this can cause issues if the user is uploading extremely large datasets that may fill temp space on the Dataverse instance.What existing behavior do you want changed?
I would like the additional of a per-dataverse setting which can optionally require uploads via direct-upload for API file uploads. Ideally, this would present the user an error immediately if they try to upload using the native API.
Currently, for our instance, users are able to attempt uploading a large file with the API then have it fail part-way through with an internal server error. The admins would see an error like this within the logs:
By adding this feature, we could catch this issue before the user spends time and bandwidth partially uploading their dataset.
Alternatively, if there is some other feature or method to prevent this scenario from happening, I'm more than happy to explore that too. Currently, we are relying on user education and pushing them to use the direct upload APIs whenever possible for datasets over 8GB.
Any open or closed issues related to this feature request?
Not that I could find.
Are you thinking about creating a pull request for this feature?
Ideally I would like to, however it may not be in the near future (3+ months probably).