Add a checkbox to disable unzipping in order to push zipped files

bjonnh commented 7 years ago

I have this use case (NMR datasets) which are (I should say can be depending on the format) composed of multiple files in a directory hierarchy.

The way we do it now is to do double zipping.

People in my group had trouble first time they tried using DV due to that issue.

Would that be possible to add a checkbox saying "do not unzip" on the upload system? (pdurbin told me it was an option before).

J.

pdurbin commented 7 years ago

Right, @bjonnh and I were talking about this at http://irclog.iq.harvard.edu/dataverse/2016-10-30#i_44159 and I can't find anything about this feature at http://guides.dataverse.org/en/3.6.2/dataverse-user-main.html but if memory serves there was a checkbox to prevent unzipping on upload.

While we're adding the checkbox, we should also make sure that an equivalent boolean is added to the new "native add" API being developed in #1612. Otherwise, this will be a GUI only feature.

I wonder if this will be trickier now that there's a "drag and drop" component to upload file. Hmm. Maybe you'd have to tick the "do not unzip" checkbox and then drag the files over.

To be clear, all this "double zip" business is really a workaround for the following shortcomings:

lack of support for file hierarchy: #2249
- "structured in subfolders and I have prepared a detailed readme file to use them to reproduce every result" https://help.hmdc.harvard.edu/Ticket/Display.html?id=242753
- "a structured set of directories in the replication file that are crucial to correctly executing the scripts" https://help.hmdc.harvard.edu/Ticket/Display.html?id=242086
- "Retention of directory structure in a .zip file is a critical component to our Project TIER protocol, where we teach students how to create reproducible empirical research. The inability to upload and download .zip files that retain this structures is a real impediment to our application of Dataverse as a platform to showcase these efforts." --@nmedeiro at https://github.com/IQSS/dataverse/issues/2249#issuecomment-244364552
getting around an installation-specific per file upload limit
- https://help.hmdc.harvard.edu/Ticket/Display.html?id=240903

lmaylein commented 7 years ago

Heidelberg University Library would appreciate this feature

asconrad commented 7 years ago

One of the use cases in our Dataverse pilot were astrophysical datasets, each organised with different kinds of data around one particular star. We build the datasets in BagIt format for long time preservation and zipped them for preserving the structure and save space. For this use case it would be a good thing to keep zipping intact. A "don't unzip" flag would, however, need to be accessible on the API as well to be real valuable to us.

shlake commented 7 years ago

UVa would like a place to turn off auto unzipping of zip files and yes, this needs to be offered via API as well (keeping zips zipped).

nmedeiro commented 7 years ago

Haverford (and Project TIER) would also appreciate having .zip retention/extraction an option at the time of submission.

Best, Norm

Norm Medeiros Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173 desk • (610) 896-1102 fax

On Thu, Sep 14, 2017 at 9:32 AM, Sherry Lake notifications@github.com wrote:

UVa would like a place to turn off auto unzipping of zip files and yes, this needs to be offered via API as well (keeping zips zipped).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/3439#issuecomment-329482758, or mute the thread https://github.com/notifications/unsubscribe-auth/AL-rjHEEB4b3mCMkC5FzHe7SoWn21Agyks5siSrwgaJpZM4KkdIj .

pdurbin commented 5 years ago

Related: #5396

pdurbin commented 5 years ago

I'm just copying and pasting the feedback from @amberleahey at https://github.com/IQSS/dataverse/issues/2107#issuecomment-467976350

"Hi folks! I know this post is old but I wanted to chime in and ask if there are plans to add the option to upload a .zip and NOT unpack? This would allow authors to choose to upload and retain zip if they wanted, maybe it would be default, but at least present the option. I think the inability to do so at the moment means users are double zipping and creating tar zip packages to get around it. We noticed this in our instance anyway. Any thoughts on reviving this convo?"

djbrooke commented 5 years ago

We have file hierarchy support in Dataverse now (hooray!) so I'm going to adjust the title here and close out a similar issue in #2107.

pdurbin commented 5 years ago

Just to remind everyone, @rmo-cdsp already made pull request #5396 to address this issue, to add a checkbox in the UI that says "Unzip zip files". He also implemented an API equivalent. So we can warm up that pull request (merge the latest from develop) if we'd like to test it. I'd be happy to spin up a branch so people can take a look. I haven't seen the UI myself.

djbrooke commented 5 years ago

I don't think we should implement this in the codebase, as keeping the files zipped means users miss out on a lot of file-level options (ingest, exploration, UNFs) and I think we'd want individual files for preservation purposes instead of a zip. By implementing file hierarchy we picked off one use case for keeping things zipped and I'm interested in other use cases so that we can discuss ways of addressing them without encouraging people to keep things zipped. Maybe an expansion of package files can help here.

pdurbin commented 5 years ago

@djbrooke I absolutely agree that understanding the use cases, the reasons why people feel the need to have zipped files in Dataverse, is crucial. I agree that zip is a suboptimal preservation format.

From the comments above, here is my take on why people who have written in this issue want zipped files in Dataverse:

@bjonnh wanted file hierarchy (now available)
@lmaylein UNKNOWN
@asconrad wanted file hierarchy and to save disk space
@shlake UNKNOWN
@nmedeiro UNKNOWN
@amberleahey UNKNOWN

If anyone reading this could elaborate on why you want to store zipped files in Dataverse, please leave a comment. Thanks! 😄 🙏

nmedeiro commented 5 years ago

I wanted to preserve the file hierarchy so code referencing data in different folders wouldn't fail.

Norm Medeiros Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Thu, Jul 18, 2019 at 12:16 PM Philip Durbin notifications@github.com wrote:

@djbrooke https://github.com/djbrooke I absolutely agree that understanding the use cases, the reasons why people feel the need to have zipped files in Dataverse, is crucial. I agree that zip is a suboptimal preservation format.

From the comments above, here is my take on why people who have written in this issue want zipped files in Dataverse:

@bjonnh https://github.com/bjonnh wanted file hierarchy (now available)

@lmaylein https://github.com/lmaylein UNKNOWN

@asconrad https://github.com/asconrad wanted file hierarchy and to save disk space

@shlake https://github.com/shlake UNKNOWN

@nmedeiro https://github.com/nmedeiro UNKNOWN

@amberleahey https://github.com/amberleahey UNKNOWN

If anyone reading this could elaborate on why you want to store zipped files in Dataverse, please leave a comment. Thanks! 😄 🙏

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/3439?email_source=notifications&email_token=AC72XDGSQZ7OOGZC556YINTQACJNJA5CNFSM4CUR2IR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2JAEBA#issuecomment-512885252, or mute the thread https://github.com/notifications/unsubscribe-auth/AC72XDD3GPUGXJHGFLGQL3LQACJNJANCNFSM4CUR2IRQ .

shlake commented 5 years ago

My original want was to preserve file hierarchy (DONE !!! thanks). But recently I got a request to upload "lots" of files and the researcher did not want "lots" of individual files in their dataset. They wanted to zip (keep zipped) and uploaded as one.

I am curious what @djbrooke means by "expansion of package files". Does Dataverse have "package files"?

scolapasta commented 5 years ago

@shlake "Package" files are what we use to track uploads via rsync. In the orginal use case, individual files didn't matter, an end user would only want to download them as a "package".

When I first suggested the concept of "package files", I had in mind that this would be the way Dataverse could deal with any set of files that fit this criteria, and not just via rysnc.

On the back end we could unzip and store individually (for preservation), as well as keep a copy of the zip for easy download.

I believe this is what we do now with package files to allow download via S3.

jggautier commented 5 years ago

Some depositors wanted files kept zipped to avoid Dataverse's duplication checks.

shlake commented 5 years ago

@jggautier YES! I forgot about that - to avoid Dataverse's duplication check!

scolapasta commented 5 years ago

@jggautier @shlake sure, but in that case it's a workaround for the real desire which is to allow Dataverse to upload the same file (same checksum) multiple times. We have discussed this separately and may change the rule (at least allow it if in different directory; or as a warning instead of an error).

shlake commented 5 years ago

@scolapasta - now I remember that one UVa dataset had a duplicate file

Here are my comments from a Google Group discussion: https://groups.google.com/forum/?hl=en#!topic/dataverse-community/FLnm8-60sOs

I am not in the business of questioning why a researcher has "duplicate" files with different file names in their dataset. So is there any work around for dataverse to accept these files?

If a researcher has two files that just happen to contain the same information (the same checksum), I don't think that should stop that file from being uploaded, maybe flagged??. There may be a reason for different filenames w/ same content (such as: used in a script as part of analysis - where the title of the file is important to the script and thus would be important for transparency and understanding of the methodology).

pdurbin commented 5 years ago

There may be a reason for different filenames w/ same content

I just gave an example of how it's common in the Python world to see empty __init__.py files scattered throughout a project over in the "As a researcher, I need to publish a dataset that contains files with the same content, which are handled differently" issue at https://github.com/IQSS/dataverse/issues/4813#issuecomment-512912799

Thanks to all for all the feedback so far! Much appreciated! I'm always interesting in the "why". 😄

Another thought: What if there were a checkbox in the UI to disable unzipping but it could be turned off at the installation level? (Or the opposite, you have to explicitly turn on the checkbox.) That way installations would have some choice.

bjonnh commented 5 years ago

I believe this all solves the issue I raised as the feature is now present and working. Thanks to all of you that have been involved.

mankoff commented 3 years ago

Why is this issue closed? Was something implemented for this feature?

djbrooke commented 3 years ago

Hi @mankoff - the original reporter closed it.

The strategy around this has been to better handle the specific cases about "why" people want to keep their files zipped, instead of providing the option to keep the files zipped during upload. The reasoning is that zips are not as FAIR, there is a significant file-level feature set missed out on with zips, and zips are not as preservation-friendly. We've added support for lots of use cases (file hierarchy, duplicate MD5s, duplicate file names in different folders, etc.) that were brought up as reasons for keeping things zipped, but I'm sure there are others as well.

mankoff commented 3 years ago

Thanks for the explanation. Our use cases: Shapefiles. Or a group of 10 little MATLAB functions.

foobarbecue commented 3 years ago

We also would really like to have this checkbox. We have people that like to upload zip files containing tens of thousands of images and text files (e.g. training sets or engineering datasets). Dataverse chokes on these .zips trying to unzip them. Maybe I could talk them into using tarballs or something... For now, we're using the zip-in-a-zip workaround.

lmaylein commented 3 years ago

@djbrooke

The strategy around this has been to better handle the specific cases about "why" people want to keep their files zipped, instead of providing the option to keep the files zipped during upload.

I would have fewer problems importing data with more complex folder structures if the tree view was the default. I'm afraid many users overlook the possibility to switch to the tree view.

mankoff commented 3 years ago

@lmaylein Tree view isn't even an option if there are no sub-folders, so it shouldn't (currently) be default because setting it to default only if sub-folders exist would lead to two very different views depending on the dataverse. However, I agree with you. Tree view as a default should be a DV-level option, and datasets without folders should present as ./ or /.

I believe it is Dataverse / IQSS policy not to re-open closed issues, so this discussion is probably occurring in the wrong place. I suggest someone here open a new ticket. I agree with @djbrooke and his comment above about all the issues with ZIP files. But Dataverse policy here may be letting perfection be the enemy of good. Perfect FAIR-ness and not supporting ZIP files will make some users use other solutions. I think the correct behavior is to allow ZIP files, but discourage it. Make us jump through some hoops to do it, but allow it. Archives of some type is a requirement for certain types of data.

djbrooke commented 3 years ago

@lmaylein @mankoff @foobarbecue - thanks for the discussion here. I do think a new issue would be a good way to restart the discussion. I still have reservations about this, but it would be good to reset on the remaining (or new!) use cases for why files should remain zipped.

Regarding the tree view, we may revisit it in the future but it was initially implemented as just a view without any of the usual file-level options. I'd be hesitant to make it the default view.

mankoff commented 3 years ago

See #8029.

IQSS / dataverse

Add a checkbox to disable unzipping in order to push zipped files #3439