Download Bundle - Githubissues

bbengfort commented 8 years ago

Add bundle (as in Sckit-Learn bundle) download mechanism to the interface. This mechanism should export:

cleaned data files
readme.md
license.txt
citation.bib

looselycoupled commented 8 years ago

I'm on this one.

looselycoupled commented 8 years ago

@bbengfort and @rebeccabilbro, please check out my proposal below and let me know if anything is counter to project requirements.

Proposal:

There are a number of long term issues with the proposal below but it gets us closer to what we want. Some would require answering outstanding questions or perhaps unassigned/unidentified issues.

S3 Buckets / Storage Prepackaged download bundles are stored in S3 bucket that already holds the base files. Bucket is open to world but no browsing allowed. According to current code, each dataset has its own folder specified by datasets/<account>/<dataset> and each datafile is stored here. We will create a bundles directory with sub-directories for each version as in datasets/<account>/<dataset>/bundles/12. The bundle filename will always be <dataset>-bundle-v<version>.zip ala floompa-bundle-v12.zip.

Question: Alternatively we could create bundles/<account>/<dataset>/<version> and keep them somewhat separate. Thoughts? Is that even needed long term for security or other reasons?

Question: Should we keep bundles on a different bucket and use a UUID as folder name to obfuscate so that only those who have the link for a private dataset can download? Should we just use a UUID as the bundle name or is there a requirement that it be friendly filename in some way?

Security For the moment, if a user has a link then they can download the bundle even if it's a private dataset.

Bundle generation Whenever an update is needed a new celery task is enqueued to replace (or initially add) a bundle. Presumably one could trigger a bunch of updates relatively quickly. There is a timing problem here that only the latest bundle is ever generated. I'd like to punt this problem until I have a better idea of how we are versioning the individual files (seems easily solvable in the future).

User Interface Users can use the download link in the project page. If no bundle is yet available then a pop-up message is displayed (I can also color code the download button yellow until ready). Else the download link is direct to s3 http download. I'll likely make a new dataset field to determine if the bundle is ready - either a simple boolean or perhaps something more informative. What might be best is a DatasetVersion model to map DataFiles to Datasets. That would be a natural place for status and give us more flexibility in the future.

looselycoupled commented 8 years ago

[x] Develop new DatasetVersion model
[ ] Develop migration file for existing data?
[x] Modify upload code to increment dataset version
[x] Develop celery task to bundle content, update, version record
[x] Color code Download link
[ ] Provide popup with download links for available versions

bbengfort commented 8 years ago

Point on security: at the moment (I believe) the bucket requires a token to give up the goods, and that token is generated via boto through the Django Storages app. The token grants the user a download, and the link only lasts for 6 hours or something. Meaning that the link isn't created for a user who doesn't have permission.

If this is not the case; then I must have manually edited the bucket for development reasons, and we should go back to the token method above.

bbengfort commented 8 years ago

Also, I'm happy to store the bundles on S3 if that's what you think we should do. However, I was planning to generate the zip file on demand with the things that are in the database via the zipfile library and StringIO objects, sort of like Use compressed data directly – from ZIP files or gzip http response

Maybe you're thinking this doesn't scale, which is fair; so bunldes/account/dataset-version.zip seems fine to me. All the rest of your proposal looks good to me.

looselycoupled commented 8 years ago

Current status: A new bundle is created whenever a file is added and the download link works correctly.

Todo: Only major item left is to create a new many-to-many so that we can keep track of which files go with which versions. Right now everything maps to the latest version which is the only download provided. Goal is to keep track of the dataset at every version and offer downloads for each.

bbengfort commented 8 years ago

I like the idea of being able to download a dataset at previous versions - that will help with estimator reproducibility and a host of other items.

DistrictDataLabs / cultivar

Download Bundle #59