Open mankoff opened 3 years ago
ADA has the same requirement - we make extensive use of zips for various reasons including the cases above.
@mankoff @stevenmce (and others reading this) how should it look in the UI? Please feel free to draw on a cocktail napkin. 😄
Do you agree with "Add a checkbox to disable unzipping in order to push zipped files" which is how #3439 is worded?
That way, Dataverse is discouraging (as @mankoff said) the uploading of zips (unzipping would be the default) but you can opt-out of unzipping by checking a box.
A cool new feature of 5.12 is support for a new Zip Previewer and file extractor for zip files, an external tool. Basically, it allows you to navigate within zip files uploaded to Dataverse and download this or that file from within the zip. Pretty neat! But would it encourage more zips? 😄 Either way, I think we should let people more easily decide if they want zips or not. Give them that checkbox (or whatever UI), I say, so they don't have to resort to double zipping.
Finally, should we "small chunk" this by offering the ability to opt out of zipping per upload via the API? Is that of value? We recently started allowing people to opt of out ingest via API, thanks to @lubitchv in PR #8532.
Hello, just to add that there is also restriction on number of files in zip which causes troubles, especially when importing a large number of datasets from other sources. In such a case we would prefer just to keep the zip as is by default.
I'd just like to note that @mankoff let us know that he's personally less involved with a Dataverse installation these days so perhaps a fresh issue with a new champion would be good. Or I'd be happy to create a Google doc if the new champions would prefer to refine some ideas there first. Please get in touch!
My suggestion would be an API-first approach. There's a new ability to skip ingest of tabular files when uploading via API (tabIngest=false) so perhaps we could have a similar unzip=false
when uploading via API. Perhaps the new dvwebloader could use such a unzip=false
API option, if we added it.
My use-case for using zip archives is a large number of ultrasound images that I am archiving by type. I uploaded one of these zip files (11.zip, containing about 650 images) to a Dataverse dataset today (6/1/2023) via the API. I was not expecting the zip to auto-unzip and leave me without a zip file reference within the dataset but that is what happened. I do not forsee any use-cases where researchers will want to pick out individual files for analysis, so it seems that individual DOI creation of each of the subfiles in my zip archives is unnecessary and wasteful. Having MD5 or some other checksum seems sufficient and if the files are described sufficiently in the metadata then having the Dataverse auto-extracting the contents of the zip file and polluting the files list in the web interface is not desirable and makes it less user friendly. If the API would allow for quick extraction of files based on the directoryLabel
then the auto-extraction of archives in the Dataverse is slightly more palatable. As it stands I cannot simply make an API query for 11.zip, extract the dataset ID and download the data from that one archive, but instead would need to query an API endpoint containing the directoryLabel
of individual files from the 11.zip file, parse out those dataset IDs and then build a POST request with the dataset IDs of the archive to retrieve those files.
With that said, there are many reasons to keep zip files unzipped and intact resources within a dataset. Some sets of files should only exist as a group and not separate, and the DOI creation of individual files of a large archive can be a resource drain and unnecessary and make it appear the upload failed due to the long processing times (which is what I am experiencing currently).
API would allow for quick extraction of files based on the directoryLabel then the auto-extraction of archives in the Dataverse is slightly more palatable.
Yes, this is already supported. You can find some screenshots here:
For now, the workaround to keep the zip a zip is to double zip. I do it myself: https://github.com/pdurbin/open-source-at-harvard-primary-data/commit/2092b752c5251d7690afec8adf9996e0dde0ab8f
Here's a copy/paste of what I do (I'm on a Mac):
zip -r primary-data.zip primary-data -x '**/.*' -x '**/__MACOSX'
zip -r outer.zip primary-data.zip -x '**/.*' -x '**/__MACOSX'
In Dataverse, outer.zip is unzipped, leaving primary-data.zip: https://dataverse.harvard.edu/file.xhtml?fileId=6867328&version=4.0
Following on #3439 because that is closed.
There are many good reasons to discourage uploading archived files, summarized as:
However, there are many use cases where archives (e.g. ZIP files) are necessary, or at least a major improvement over un-archived. Dataverse should discourage but support archive uploads.
One simple use case is when sharing a shapefile, which itself is already not very FAIR. This file format is actually a folder of n files (6 or 7?), one of which has the extension 'shp' and is also called a 'shapefile'. However, if users only download that file, it does not work. They need the entire folder. There is no benefit to exposing these 6 or 7 binary files individually, they should be distributed as a package, and the most common and supported package is a ZIP file.
Another use-case: 100,000 small text files that might make up a data set, all working together. One could argue it is un-FAIR to mint 100,000 new DOIs and require the user to deal with each of these files individually, especially given existing issues with Dataverse bulk download.