Closed matthew-a-dunlap closed 5 years ago
Had some discussion about this in the backlog grooming 8/15 and we want to discuss a few technical approaches in more detail before estimating.
At request of @pameyer - this will initially be unauthenticated
One technical approach I think we should at least consider is the "sync" command from AWS CLI.
Unfortunately, Dataverse users wanting to download files would need to install AWS CLI so it would be trickier to support than rsync, which comes standard on Mac and Linux and I presume can be installed without too much trouble on Windows (but I don't know). I have no idea how much config it requires for unauthenticated downloads, which is what we said above is all we want to support. For rsync there is no config to do for unauthenticated downloads, which is why our docs at http://guides.dataverse.org/en/4.9.2/user/find-use-data.html#downloading-a-dataverse-package-via-rsync are relatively straightforward.
The docs for "sync" can be found at https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html and it sounds like a hierarchical directory structure is supported: "Recursively copies new and updated files from the source directory to the destination."
An example specific to downloading is provided:
The following sync command syncs files in a local directory to objects under a specified prefix and bucket by downloading s3 objects....
aws s3 sync s3://mybucket .
I found out about "sync" from https://serverfault.com/questions/73959/using-rsync-with-amazon-s3
When I've tested it, AWS S3 sync
does support directory hierarchy. I haven't investigated if or how it supports un-authenticated access to public S3 objects; and I'm not fully up to speed on if a package file stored in S3 corresponds to an S3 bucket, S3 object, or something else.
s3 sync thoughts: I looked into the aws s3 throttling options and there seems to be no way to configure s3 access on the server side. The user can put parameters into their ./aws/config but that has to be voluntary: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#configuration-values .
With our current implementation we'd also have to separate the unpublished and restricted files from the s3 bucket.
My vibe is that we shouldn't expose the s3 bucket directly in this way (though we do already expose short-term direct downloads). I wonder if there are simple existing tools (web applications?) that would allow us to add a layer inbetween for access and other needs
@pameyer a package stored in s3 becomes a folder with subfolders
We may be able to create a simple "API Gateway" as a layer between s3 and the world https://aws.amazon.com/api-gateway/ & s3 gateway We may want to steer away from more aws services though.
Discussed solutions for S3 rync-uploaded files
Direct access via aws commands:
Something leveraging time-sensitive download urls
Rsync pointing to s3 mounted (fuse) on a box (separate server)
Dataverse API Gateway (layer infront of s3)
Extra notes from conversation after deciding on zip file download as the best solution
As @matthew-a-dunlap mentioned in the last comment, after meeting during tech hours, we decided on zip file download as the best solution:
(*) need to see here if we can / should support bagit
One thing that may (or may not) be worth checking; but I've been using "zip" in this discussion as a stand-in for "generic archive format" (aka - "tar", compressed or uncompressed might be another alternative). Is there a format (or implementation) imposed limit to the size of a zip file (or tar file)?
@scolapasta I moved this back to the design column because in the sprint planning meeting you mentioned some UI/UX impact. Can you elaborate on what you see as potentially impacting the UI?
Sure. Basically, when packages are large (as will often be the case), it may be best not to have a download button that automatically redirects your browser to the time limited URL, but rather to display that time limited URL (via a popup?) to the user and ask them to use their preferred download manager, or click and let browser download.
We may want to consider this to be the logic for any download? (so as to be consistent?)
According to https://en.wikipedia.org/wiki/ZIP_(file_format)#Limits; standard zip has a max of ~4G for archives (and individual files w\in the archive); ZIP64 has a max of ~16 EiB. Info-ZIP (standard for CentOS 7 / OS X?) supports ZIP64; haven't checked support on other platforms.
Looking into .tar it seems like the the limit is unlimited though maybe for some implementations the max is 8gb? https://www.linuxquestions.org/questions/red-hat-31/tar-file-size-limit-542690/ https://lists.gnu.org/archive/html/help-tar/2015-04/msg00001.html
One open question is what URL to display:
I had originally assumed it would be the time limited, but people were asking if that would could as separate downloads if used more than once (it would not, since this goes completely outside of Dataverse). In practice, since the url is time limited that would/should not happen, so I'd say we should be ok with counting just one download.
Alternatively, the one from API could work, but would have the same issue and likely be worse:
When you go through the UI, the guestbook is filled out and stored in the database as a "download". Then the browser is redirected to the API with a special flag to say when you download don't add a new row. Since we currently don't expose this url it's not so much an issue. But if we now choose to display it, someone could copy and reuse. The difference between this and the S3 url is that this url is not time limited.
So, I suggest we use the S3 time limited URL.
In the long run, we do want to separate guestbook creation from download (and connect them as needed when the download happens (see the work ADA is doing for request access). When we do that we could then stop counting the filling out the guestbook as the download and attach the guestbook id with the download. This would only allow the url in fact to be used once (download would fail if no guestbook attached), so would be the best of all solutions. But this feels very out of scope for this issue.
@scolapasta pretty please use an URL based on the dataverse instance. We are planning to use S3 (see #4690), but it will not be available from the public.
Maybe you can add a check to the endpoint of the URL if feature "direct download from S3" is enabled and redirect the browser via 302 to the temporary S3 URL?
Mockup for this feature:
Note: We'll need a new user guide section called "Downloading Package Files" on the "Finding and Using Data" page.
@dlmurphy sure but it should be an iteration on the existing "Downloading a Dataverse Package via rsync" section: http://guides.dataverse.org/en/4.9.4/user/find-use-data.html#downloading-a-dataverse-package-via-rsync
The DCM side of this work can be followed at https://github.com/sbgrid/data-capture-module/tree/s3_package_zip
Question: The expectation for this story is to switch how we show the package file from this...
...back to the "normal" style file representation? And then for the download button on that file, have it launch the popup? Thanks!
@mheppler @dlmurphy
EDIT: My understanding is this should only be switched when the file is stored on S3
@matthew-a-dunlap Correct. For a "package file on S3", we will need the download button returned to the file table, in place of the rsync instructions. The download button will open the Dataverse Package Download popup with the S3 URL.
I'll be out tomorrow, so here's a status update on this story: The wiring for the file page and after guestbook is mostly done. A few of the values being passed to the popup need to be generalized to work across pages. After that there are only a few minor dcm / dataverse fixes to do. That and improving the styling of the popup.
Cleaned up the UI of the popup, added a link to the User Guide, as well as a new placeholder section for "Downloading a Dataverse Package via URL" on the Finding and Using Data pg of the User Guide.
Note: There are still two changes incoming for this PR.
:DownloadMethod
changes and a .rst page on how to set up dcm s3.I am moving this into code review to have the code itself looked over in parallel while the doc/config changes get wrapped up.
Commented in issue Support The Ability To Resume Disrupted File Downloads #2960 suggesting that we add similar help msg regarding wget
and download manager to the Download URL metadata on the file page. I had hoped adding a similar message would be be sufficient to close that issue.
[x] Cannot mark package file on S3 restricted, not avail in UI
[x] Building pdf version of docs in this branch fails with too many levels.
[x] Installing DCM section under Big Data needs updating since it still refers to this ticket and says downloading not available.
[x] Docker-aio script has incorrect path in ./0prep.sh should ref 0.5 release but refs 0.3: https://github.com/sbgrid/data-capture-module/releases/download/0.3/dcm-0.5-0.noarch.rpm HTTP request sent, awaiting response... 404 Not Found
[x] In dev doc remove line, "Move the S3 variant of the rsync processing script into docker:"
[x] In dev doc remove line, "Install AWS on dcmsrv and symlink it" under optional instructions for DCM S3 variant
[x] In dev doc under "Dataverse configuration (on dvsrv)" remove install AWS instructions and add yum install aws to set up for container.
[x] In dev doc under "Using DCM Docker Containers" update line, "Manually run post_upload.bash on dcmsrv" to include s3 or non s3 post_update scripts.
[x] Question/issue: when S3 enabled, package file is called package_FK26AKIGX.zip in ui rather than the usual package naming convention. Was this intentional/ design decision?
[x] Clicking on add dataset with existing config/upgrade from 4.9.4, no s3 results in exception error:
[2018-12-10T13:34:24.210-0500] [glassfish 4.1] [WARNING] [] [javax.enterprise.web] [tid: _ThreadID=51 _ThreadName=jk-connector(2)] [timeMillis: 1544466864210] [levelValue: 900] [[
StandardWrapperValve[Faces Servlet]: Servlet.service() for servlet Faces Servlet threw exception
javax.faces.view.facelets.TagAttributeException: /package-download-popup-fragment.xhtml @23,56
I've fixed and committed changes for the issues in the above list. Let me know if there is anything else that's needed, thanks!
With #4703 we are supporting storage of data with DCM (rsync) on S3. There is more work needed to allow downloading of the data stored in this manner. Whether this is extending RSAL or providing some other download method (e.g. direct S3 link) needs discussion.