IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

isDirectUploadEnabled checks only if upload-redirect property is enabled for a storage driver #9002

Closed ErykKul closed 11 months ago

ErykKul commented 1 year ago

What steps does it take to reproduce the issue? When the property dataverse.files." + driverId + ".upload-redirect is enabled, the usual http file upload in UI does not work anymore:

image

On the other hand, when that property is not enabled, adding a file that was directly uploaded with api fails with an error:

Bad Request:Dataset store configuration does not allow provided storageIdentifier.

What did you expect to happen? I would assume that upload-redirect option is not the same as allowing directly uploading files. It makes sense that when the option dataverse.files." + driverId + ".upload-redirect is enabled, direct upload is allowed. Nevertheless, I think this should be extended to allowing direct uploading of files through api, even when the upload-redirect is not enabled, e.g., with a db setting. This is useful when using integration tools that have access to the storage, but the storage is not accessible from the outside of the network (either "file" or "s3" storage, etc.).

Which version of Dataverse are you using? Development

Server log: [#|2022-09-28T10:08:44.829+0200|WARNING|Payara 5.2022.1|com.amazonaws.util.EC2MetadataUtils|_ThreadID=102;_ThreadName=http-thread-pool::jk-connector(1);_TimeMillis=1664352524829;_LevelValue=900;| Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document). Failed to connect to service endpoint: com.amazonaws.SdkClientException: Failed to connect to service endpoint: at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:100) at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:70) at com.amazonaws.internal.InstanceMetadataServiceResourceFetcher.readResource(InstanceMetadataServiceResourceFetcher.java:75) at com.amazonaws.internal.EC2ResourceFetcher.readResource(EC2ResourceFetcher.java:66) at com.amazonaws.util.EC2MetadataUtils.getItems(EC2MetadataUtils.java:407) at com.amazonaws.util.EC2MetadataUtils.getData(EC2MetadataUtils.java:376) at com.amazonaws.util.EC2MetadataUtils.getData(EC2MetadataUtils.java:372) at com.amazonaws.util.EC2MetadataUtils.getEC2InstanceRegion(EC2MetadataUtils.java:287) at com.amazonaws.regions.InstanceMetadataRegionProvider.tryDetectRegion(InstanceMetadataRegionProvider.java:59) at com.amazonaws.regions.InstanceMetadataRegionProvider.getRegion(InstanceMetadataRegionProvider.java:50) at com.amazonaws.regions.AwsRegionProviderChain.getRegion(AwsRegionProviderChain.java:46) at com.amazonaws.client.builder.AwsClientBuilder.determineRegionFromRegionProvider(AwsClientBuilder.java:475) at com.amazonaws.client.builder.AwsClientBuilder.setRegion(AwsClientBuilder.java:458) at com.amazonaws.client.builder.AwsClientBuilder.configureMutableProperties(AwsClientBuilder.java:424) at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46) at edu.harvard.iq.dataverse.dataaccess.S3AccessIO.getClient(S3AccessIO.java:1200) at edu.harvard.iq.dataverse.dataaccess.S3AccessIO.<init>(S3AccessIO.java:100) at edu.harvard.iq.dataverse.dataaccess.S3AccessIO.<init>(S3AccessIO.java:123) at edu.harvard.iq.dataverse.util.FileUtil.getS3AccessForDirectUpload(FileUtil.java:1764) at edu.harvard.iq.dataverse.EditDatafilesPage.requestDirectUploadUrls(EditDatafilesPage.java:1692) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at javax.el.ELUtil.invokeMethod(ELUtil.java:239) at javax.el.BeanELResolver.invoke(BeanELResolver.java:440) at javax.el.CompositeELResolver.invoke(CompositeELResolver.java:198) at com.sun.el.parser.AstValue.invoke(AstValue.java:257) at com.sun.el.MethodExpressionImpl.invoke(MethodExpressionImpl.java:237) at org.jboss.weld.module.web.util.el.ForwardingMethodExpression.invoke(ForwardingMethodExpression.java:40) at org.jboss.weld.module.web.el.WeldMethodExpression.invoke(WeldMethodExpression.java:50) at com.sun.faces.facelets.el.TagMethodExpression.invoke(TagMethodExpression.java:65) at com.sun.faces.application.MethodBindingMethodExpressionAdapter.invoke(MethodBindingMethodExpressionAdapter.java:66) at com.sun.faces.application.ActionListenerImpl.getNavigationOutcome(ActionListenerImpl.java:82) at com.sun.faces.application.ActionListenerImpl.processAction(ActionListenerImpl.java:71) at javax.faces.component.UICommand.broadcast(UICommand.java:222) at javax.faces.component.UIViewRoot.broadcastEvents(UIViewRoot.java:847) at javax.faces.component.UIViewRoot.processApplication(UIViewRoot.java:1396) at com.sun.faces.lifecycle.InvokeApplicationPhase.execute(InvokeApplicationPhase.java:58) at com.sun.faces.lifecycle.Phase.doPhase(Phase.java:76) at com.sun.faces.lifecycle.LifecycleImpl.execute(LifecycleImpl.java:177) at javax.faces.webapp.FacesServlet.executeLifecyle(FacesServlet.java:707) at javax.faces.webapp.FacesServlet.service(FacesServlet.java:451) at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1637) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:331) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:211) at org.glassfish.tyrus.servlet.TyrusServletFilter.doFilter(TyrusServletFilter.java:282) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:253) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:211) at org.ocpsoft.rewrite.servlet.RewriteFilter.doFilter(RewriteFilter.java:226) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:253) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:211) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:257) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:160) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:757) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:577) at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:99) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:158) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:371) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:238) at com.sun.enterprise.v3.services.impl.ContainerMapper$HttpHandlerCallable.call(ContainerMapper.java:520) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:217) at org.glassfish.grizzly.http.server.HttpHandler.runService(HttpHandler.java:182) at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:156) at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:218) at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:95) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:260) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:177) at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:109) at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:88) at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:53) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:524) at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:89) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:94) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:33) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:114) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.net.SocketTimeoutException: connect timed out at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) at java.base/java.net.Socket.connect(Socket.java:609) at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177) at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474) at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569) at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242) at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341) at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362) at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253) at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1232) at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081) at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015) at com.amazonaws.internal.ConnectionUtils.connectToEndpoint(ConnectionUtils.java:95) at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:80) ... 78 more |#]

ErykKul commented 1 year ago

@qqmyers The method causing this problem is the isDirectUploadEnabled in StorageIO.java. Would it be OK if I work something out with a new DB setting?

qqmyers commented 1 year ago

@ErykKul - direct upload via the UI on develop seems to be working for me. There was a bug between the PR to add the isDirectUploadEnabled check and the Globus PR that broke this. Is it possible you are not up to date on develop?

FWIW: The current functionality is that, when direct upload is enabled, the UI automatically switches to use it while the API will allow both direct and normal uploads. So far that hasn't seemed to cause issues but finer grained control to change the API and UI separately, or to disallow normal upload via API when direct is enabled could be useful.

ErykKul commented 1 year ago

@qqmyers The direct upload from UI did not work for me as we do not have a s3 storage yet. Maybe I am doing something wrong or at least something that is a bad idea: I am working on an integration tool (right now only with GitHub, other systems will follow) that will be hosted on the same network and with access to the same file system as Dataverse server. The tool can upload files directly to s3 or to the file system, depending on the selected storage. Since we do not have s3 storage yet, I am testing it with the file system, where the tool uploads the files (and hashes them in a single pass) directly to the Dataverse file system. Once the file is ready, I am using the API to add the uploaded file to the corresponding Dataset. I have foreseen the s3 integration from the start such that when we switch to the new storage, the tool will remain usable.

The thing that looks dangerous to me is that I generate the storage identifier in the tool and upload the file directly to that location, without the use of the temp folder. I will implement a cleanup that will remove incomplete files (e.g., after a hard restart or a crash of the tool). However, if the system of storage identifiers would change in Dataverse, this could break the tool. The tool itself is on GitHub (I am writing it in Go): https://github.com/libis/rdm-integration

I have created a pull request linked to this issue that solves my problem with direct upload via API. All comments are welcome.

qqmyers commented 1 year ago

@ErykKul - ah. Direct upload wasn't designed for file stores since there's no easy way for Dataverse to control access. That said, you're not the only one who's found it convenient to leverage only the /addFiles calls coupled with a trusted tool doing out-of-band storage (DANS for migration). The updates in 5.12 constrain that call to only allowing a file with a valid name and location to be registered (so you'd know if you were putting a file in the wrong place if that location is changed since the call would fail).

TLDR: Making a separate API only jvm-option -as you've done - seems like a reasonable way to support out-of-band tools for now (needed in 5.12 due to the stricter constraint to only allow /addFiles when direct upload is enabled).

That said, I wonder if it should be further restricted, i.e. the api only option is only valid for superusers, or it is renamed/updated to only work for the /addFiles call itself (and a /replaceFiles call in development) since it mainly aimed at out-of-band trusted tools (and won't magically allow direct upload to file stores via API without out-of-band transfer). With the increased checking on the format of storageidentifiers, I don't think there's a risk for Dataverse as much as a possibility for users to be frustrated thinking that such a call is allowed. (Probably could use some documentation about /addFiles in the guide to indicate it only works in concert with direct upload or other out-of-band means of transferring the files - I look at that when I finish /replaceFiles).

FWIW, the new Globus functionality leverages the /addFiles call as well, although in that case Dataverse is managing allowing temporary write access at the Dataset level for the transfer itself. It also has to generate an storage identifier since it also does not use the direct upload calls to get signed S3 URLs (which include a Dataverse-generated storageidentifier). A new (superuser only?) API call to request a valid storageidentifier(s) might be a reasonable addition that could be used in these cases (yours, Globus).

W.r.t. temp files - we have this issue in general (i.e. if a user does a UI direct upload and leaves the page without hitting cancel or save, Dataverse doesn't know and doesn't clean up the files. Similarly in the direct upload API, if the final /addFiles call isn't done, the files are abandoned. In S3 systems that support it, we add a tag so these temp files can be easily found and deleted (i.e. via a cron job), but it would be nicer if there were an API call to remove files unrelated to a dataset on S3 (or file system, etc.) Such an API would have to deal with files in different versions and all of the auxiliary files, thumbnails, metadata exports that should remain but delete any files with storageidentifiers that aren't in the dataset. I've thought about writing that but haven't gotten to it. If you are going to manage cleanup for your tool, please consider adding that functionality to Dataverse instead as I think it's utility would be greater if it were embedded.

ErykKul commented 1 year ago

@qqmyers Thank you! I will keep it in mind while working on the tool.

ErykKul commented 1 year ago

I have tested adding files with SWORD API, and it works as a workaround. It is only native API that throws "Dataset store configuration does not allow provided storageIdentifier".