IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
858 stars 482 forks source link

Download Package File from S3 #4949

Closed matthew-a-dunlap closed 5 years ago

matthew-a-dunlap commented 5 years ago

With #4703 we are supporting storage of data with DCM (rsync) on S3. There is more work needed to allow downloading of the data stored in this manner. Whether this is extending RSAL or providing some other download method (e.g. direct S3 link) needs discussion.

djbrooke commented 5 years ago

Had some discussion about this in the backlog grooming 8/15 and we want to discuss a few technical approaches in more detail before estimating.

djbrooke commented 5 years ago

At request of @pameyer - this will initially be unauthenticated

pdurbin commented 5 years ago

One technical approach I think we should at least consider is the "sync" command from AWS CLI.

Unfortunately, Dataverse users wanting to download files would need to install AWS CLI so it would be trickier to support than rsync, which comes standard on Mac and Linux and I presume can be installed without too much trouble on Windows (but I don't know). I have no idea how much config it requires for unauthenticated downloads, which is what we said above is all we want to support. For rsync there is no config to do for unauthenticated downloads, which is why our docs at http://guides.dataverse.org/en/4.9.2/user/find-use-data.html#downloading-a-dataverse-package-via-rsync are relatively straightforward.

The docs for "sync" can be found at https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html and it sounds like a hierarchical directory structure is supported: "Recursively copies new and updated files from the source directory to the destination."

An example specific to downloading is provided:

The following sync command syncs files in a local directory to objects under a specified prefix and bucket by downloading s3 objects....

aws s3 sync s3://mybucket .

I found out about "sync" from https://serverfault.com/questions/73959/using-rsync-with-amazon-s3

pameyer commented 5 years ago

When I've tested it, AWS S3 sync does support directory hierarchy. I haven't investigated if or how it supports un-authenticated access to public S3 objects; and I'm not fully up to speed on if a package file stored in S3 corresponds to an S3 bucket, S3 object, or something else.

matthew-a-dunlap commented 5 years ago

s3 sync thoughts: I looked into the aws s3 throttling options and there seems to be no way to configure s3 access on the server side. The user can put parameters into their ./aws/config but that has to be voluntary: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#configuration-values .

With our current implementation we'd also have to separate the unpublished and restricted files from the s3 bucket.

My vibe is that we shouldn't expose the s3 bucket directly in this way (though we do already expose short-term direct downloads). I wonder if there are simple existing tools (web applications?) that would allow us to add a layer inbetween for access and other needs

@pameyer a package stored in s3 becomes a folder with subfolders

matthew-a-dunlap commented 5 years ago

We may be able to create a simple "API Gateway" as a layer between s3 and the world https://aws.amazon.com/api-gateway/ & s3 gateway We may want to steer away from more aws services though.

matthew-a-dunlap commented 5 years ago

Discussed solutions for S3 rync-uploaded files

Direct access via aws commands:

Something leveraging time-sensitive download urls

Rsync pointing to s3 mounted (fuse) on a box (separate server)

Dataverse API Gateway (layer infront of s3)

Extra notes from conversation after deciding on zip file download as the best solution

scolapasta commented 5 years ago

As @matthew-a-dunlap mentioned in the last comment, after meeting during tech hours, we decided on zip file download as the best solution:

(*) need to see here if we can / should support bagit

pameyer commented 5 years ago

One thing that may (or may not) be worth checking; but I've been using "zip" in this discussion as a stand-in for "generic archive format" (aka - "tar", compressed or uncompressed might be another alternative). Is there a format (or implementation) imposed limit to the size of a zip file (or tar file)?

djbrooke commented 5 years ago

@scolapasta I moved this back to the design column because in the sprint planning meeting you mentioned some UI/UX impact. Can you elaborate on what you see as potentially impacting the UI?

scolapasta commented 5 years ago

Sure. Basically, when packages are large (as will often be the case), it may be best not to have a download button that automatically redirects your browser to the time limited URL, but rather to display that time limited URL (via a popup?) to the user and ask them to use their preferred download manager, or click and let browser download.

We may want to consider this to be the logic for any download? (so as to be consistent?)

pameyer commented 5 years ago

According to https://en.wikipedia.org/wiki/ZIP_(file_format)#Limits; standard zip has a max of ~4G for archives (and individual files w\in the archive); ZIP64 has a max of ~16 EiB. Info-ZIP (standard for CentOS 7 / OS X?) supports ZIP64; haven't checked support on other platforms.

matthew-a-dunlap commented 5 years ago

Looking into .tar it seems like the the limit is unlimited though maybe for some implementations the max is 8gb? https://www.linuxquestions.org/questions/red-hat-31/tar-file-size-limit-542690/ https://lists.gnu.org/archive/html/help-tar/2015-04/msg00001.html

TaniaSchlatter commented 5 years ago

Design in progress: https://docs.google.com/document/d/1zcOt4Xwz3kxbJITM1HuLDK2tUaMpV7QAnDUWjmDV3io/edit?usp=sharing

scolapasta commented 5 years ago

One open question is what URL to display:

I had originally assumed it would be the time limited, but people were asking if that would could as separate downloads if used more than once (it would not, since this goes completely outside of Dataverse). In practice, since the url is time limited that would/should not happen, so I'd say we should be ok with counting just one download.

Alternatively, the one from API could work, but would have the same issue and likely be worse:

When you go through the UI, the guestbook is filled out and stored in the database as a "download". Then the browser is redirected to the API with a special flag to say when you download don't add a new row. Since we currently don't expose this url it's not so much an issue. But if we now choose to display it, someone could copy and reuse. The difference between this and the S3 url is that this url is not time limited.

So, I suggest we use the S3 time limited URL.

In the long run, we do want to separate guestbook creation from download (and connect them as needed when the download happens (see the work ADA is doing for request access). When we do that we could then stop counting the filling out the guestbook as the download and attach the guestbook id with the download. This would only allow the url in fact to be used once (download would fail if no guestbook attached), so would be the best of all solutions. But this feels very out of scope for this issue.

poikilotherm commented 5 years ago

@scolapasta pretty please use an URL based on the dataverse instance. We are planning to use S3 (see #4690), but it will not be available from the public.

Maybe you can add a check to the endpoint of the URL if feature "direct download from S3" is enabled and redirect the browser via 302 to the temporary S3 URL?

dlmurphy commented 5 years ago

Mockup for this feature:

largefiledownloadv3

Note: We'll need a new user guide section called "Downloading Package Files" on the "Finding and Using Data" page.

pdurbin commented 5 years ago

@dlmurphy sure but it should be an iteration on the existing "Downloading a Dataverse Package via rsync" section: http://guides.dataverse.org/en/4.9.4/user/find-use-data.html#downloading-a-dataverse-package-via-rsync

matthew-a-dunlap commented 5 years ago

The DCM side of this work can be followed at https://github.com/sbgrid/data-capture-module/tree/s3_package_zip

matthew-a-dunlap commented 5 years ago

Question: The expectation for this story is to switch how we show the package file from this...

screen shot 2018-11-05 at 3 16 29 pm

...back to the "normal" style file representation? And then for the download button on that file, have it launch the popup? Thanks!

@mheppler @dlmurphy

EDIT: My understanding is this should only be switched when the file is stored on S3

mheppler commented 5 years ago

@matthew-a-dunlap Correct. For a "package file on S3", we will need the download button returned to the file table, in place of the rsync instructions. The download button will open the Dataverse Package Download popup with the S3 URL.

matthew-a-dunlap commented 5 years ago

I'll be out tomorrow, so here's a status update on this story: The wiring for the file page and after guestbook is mostly done. A few of the values being passed to the popup need to be generalized to work across pages. After that there are only a few minor dcm / dataverse fixes to do. That and improving the styling of the popup.

mheppler commented 5 years ago

Cleaned up the UI of the popup, added a link to the User Guide, as well as a new placeholder section for "Downloading a Dataverse Package via URL" on the Finding and Using Data pg of the User Guide.

screen shot 2018-11-30 at 9 36 25 am
matthew-a-dunlap commented 5 years ago

Note: There are still two changes incoming for this PR.

  1. Documentation around the :DownloadMethod changes and a .rst page on how to set up dcm s3.
  2. A bump to the dcm version number once this pr is merged https://github.com/sbgrid/data-capture-module/pull/35

I am moving this into code review to have the code itself looked over in parallel while the doc/config changes get wrapped up.

mheppler commented 5 years ago

Commented in issue Support The Ability To Resume Disrupted File Downloads #2960 suggesting that we add similar help msg regarding wget and download manager to the Download URL metadata on the file page. I had hoped adding a similar message would be be sufficient to close that issue.

kcondon commented 5 years ago

[2018-12-10T13:34:24.210-0500] [glassfish 4.1] [WARNING] [] [javax.enterprise.web] [tid: _ThreadID=51 _ThreadName=jk-connector(2)] [timeMillis: 1544466864210] [levelValue: 900] [[ StandardWrapperValve[Faces Servlet]: Servlet.service() for servlet Faces Servlet threw exception javax.faces.view.facelets.TagAttributeException: /package-download-popup-fragment.xhtml @23,56 Invalid path : file-info-fragment.xhtml at com.sun.faces.facelets.tag.ui.IncludeHandler.apply(IncludeHandler.java:129) at javax.faces.view.facelets.CompositeFaceletHandler.apply(CompositeFaceletHandler.java:95) at javax.faces.view.facelets.DelegatingMetaTagHandler.applyNextHandler(DelegatingMetaTagHandler.java:137) at com.sun.faces.facelets.tag.jsf.ComponentTagHandlerDelegateImpl.apply(ComponentTagHandlerDelegateImpl.java:203) at javax.faces.view.facelets.DelegatingMetaTagHandler.apply(DelegatingMetaTagHandler.java:120) at com.sun.faces.facelets.tag.ui.CompositionHandler.apply(CompositionHandler.java:194) at com.sun.faces.facelets.compiler.NamespaceHandler.apply(NamespaceHandler.java:93) at com.sun.faces.facelets.compiler.EncodingHandler.apply(EncodingHandler.java:87) at com.sun.faces.facelets.impl.DefaultFacelet.include(DefaultFacelet.java:312) at com.sun.faces.facelets.impl.DefaultFacelet.include(DefaultFacelet.java:371) at com.sun.faces.facelets.impl.DefaultFacelet.include(DefaultFacelet.java:350) at com.sun.faces.facelets.impl.DefaultFaceletContext.includeFacelet(DefaultFaceletContext.java:199) at com.sun.faces.facelets.tag.ui.IncludeHandler.apply(IncludeHandler.java:124) at javax.faces.view.facelets.CompositeFaceletHandler.apply(CompositeFaceletHandler.java:95) at javax.faces.view.facelets.DelegatingMetaTagHandler.applyNextHandler(DelegatingMetaTagHandler.java:137) at com.sun.faces.facelets.tag.jsf.ComponentTagHandlerDelegateImpl.apply(ComponentTagHandlerDelegateImpl.java:203) at javax.faces.view.facelets.DelegatingMetaTagHandler.apply(DelegatingMetaTagHandler.java:120) at javax.faces.view.facelets.CompositeFaceletHandler.apply(CompositeFaceletHandler.java:95) at javax.faces.view.facelets.DelegatingMetaTagHandler.applyNextHandler(DelegatingMetaTagHandler.java:137) at com.sun.faces.facelets.tag.jsf.ComponentTagHandlerDelegateImpl.apply(ComponentTagHandlerDelegateImpl.java:203) at javax.faces.view.facelets.DelegatingMetaTagHandler.apply(DelegatingMetaTagHandler.java:120) at javax.faces.view.facelets.CompositeFaceletHandler.apply(CompositeFaceletHandler.java:95) at com.sun.faces.facelets.tag.ui.DefineHandler.applyDefinition(DefineHandler.java:106) at com.sun.faces.facelets.tag.ui.CompositionHandler.apply(CompositionHandler.java:206) at com.sun.faces.facelets.impl.DefaultFaceletContext$TemplateManager.apply(DefaultFaceletContext.java:395) at com.sun.faces.facelets.impl.DefaultFaceletContext.includeDefinition(DefaultFaceletContext.java:366) at com.sun.faces.facelets.tag.ui.InsertHandler.apply(InsertHandler.java:111) at javax.faces.view.facelets.CompositeFaceletHandler.apply(CompositeFaceletHandler.java:95) at javax.faces.view.facelets.DelegatingMetaTagHandler.applyNextHandler(DelegatingMetaTagHandler.java:137) at com.sun.faces.facelets.tag.jsf.ComponentTagHandlerDelegateImpl.apply(ComponentTagHandlerDelegateImpl.java:203) at javax.faces.view.facelets.DelegatingMetaTagHandler.apply(DelegatingMetaTagHandler.java:120) at javax.faces.view.facelets.CompositeFaceletHandler.apply(CompositeFaceletHandler.java:95) at com.sun.faces.facelets.tag.jsf.core.ViewHandler.apply(ViewHandler.java:225) at javax.faces.view.facelets.CompositeFaceletHandler.apply(CompositeFaceletHandler.java:95) at com.sun.faces.facelets.compiler.NamespaceHandler.apply(NamespaceHandler.java:93) at com.sun.faces.facelets.compiler.EncodingHandler.apply(EncodingHandler.java:87) at com.sun.faces.facelets.impl.DefaultFacelet.include(DefaultFacelet.java:312) at com.sun.faces.facelets.impl.DefaultFacelet.include(DefaultFacelet.java:371) at com.sun.faces.facelets.impl.DefaultFacelet.include(DefaultFacelet.java:350) at com.sun.faces.facelets.impl.DefaultFaceletContext.includeFacelet(DefaultFaceletContext.java:199) at com.sun.faces.facelets.tag.ui.CompositionHandler.apply(CompositionHandler.java:174) at com.sun.faces.facelets.compiler.NamespaceHandler.apply(NamespaceHandler.java:93) at com.sun.faces.facelets.compiler.EncodingHandler.apply(EncodingHandler.java:87) at com.sun.faces.facelets.impl.DefaultFacelet.apply(DefaultFacelet.java:161) at com.sun.faces.application.view.FaceletViewHandlingStrategy.buildView(FaceletViewHandlingStrategy.java:990) at com.sun.faces.lifecycle.RenderResponsePhase.execute(RenderResponsePhase.java:99) at com.sun.faces.lifecycle.Phase.doPhase(Phase.java:101) at com.sun.faces.lifecycle.LifecycleImpl.render(LifecycleImpl.java:219) at javax.faces.webapp.FacesServlet.service(FacesServlet.java:647) at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1682) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:344) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at org.glassfish.tyrus.servlet.TyrusServletFilter.doFilter(TyrusServletFilter.java:295) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at org.ocpsoft.rewrite.servlet.RewriteFilter.doFilter(RewriteFilter.java:226) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:316) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:160) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:734) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:673) at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:99) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:174) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:415) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:282) at com.sun.enterprise.v3.services.impl.ContainerMapper$HttpHandlerCallable.call(ContainerMapper.java:459) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:167) at org.glassfish.grizzly.http.server.HttpHandler.runService(HttpHandler.java:201) at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:175) at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:235) at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:284) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:201) at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:133) at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:112) at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:561) at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:565) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:545) at java.lang.Thread.run(Thread.java:748) ]]

matthew-a-dunlap commented 5 years ago

I've fixed and committed changes for the issues in the above list. Let me know if there is anything else that's needed, thanks!