IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
881 stars 492 forks source link

File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249

Closed eaquigley closed 5 years ago

eaquigley commented 9 years ago

User community request to be able to organize files in a dataset in a hierarchical manner so when a user exports files from Google Drive, Dropbox, or OSF (or any other location), the file structure is maintained without causing the user extra work on the Dataverse side.

dpwrussell commented 9 years ago

One highly related thing which is extremely prevalent in microscopy and I would guess other fields, is that in addition to encoding metadata into the directory hierarchy, they have also encoded it into the filenames, usually underscore separated.

E.g. /usr/people/bioc0759/data/EB1-posterior-polarity/EB1-Colcemid-UV-inactivation/RMP_20090228_colcemid_UV_inactivation/rmp_20090228_colcemid-6hrs_EB1EB1_stg9_Az_18_R3D.dv

Some of this will be important metadata, some of it may not be. The ability to automatically or semi-automatically import some of this metadata (but hopefully not the junk) in the form of tag annotations sounds useful so that search/filtering can make use of them.

dpwrussell commented 9 years ago

It might be useful to have a look at this tool, built for the Open Microscopy Environment. The UI is not beautiful, but it does this kind of metadata extraction into Tag Annotations: https://www.openmicroscopy.org/site/products/partner/omero.webtagging/

There is also a "search" tool which should really be called "navigation" because it allows the user to browse the graph of tags from any origin point. This resembled filesystem navigation somewhat and seemed to satisfy some users.

Caveat Emptor: The queries to do this navigation because the tags are stored in a relational DB can get very slow if there are large numbers of tags and/or large amounts of data tagged with them. It would be ideal to be storing and updating a graph DB for this functionality to make these queries performant.

eaquigley commented 9 years ago

FRD for this feature (work in progress): https://docs.google.com/document/d/1PqL6EljP-N51rt3puy3HedStrnV5DOJ3Gf7H_zPHcA0/edit?usp=sharing

pdurbin commented 8 years ago

Feedback from @pameyer: "preserving file naming and directory structure (with the exception of files.sha which holds the checksums) is important for users downloading the dataset, and doing computation locally on it".

Mostly I just want to make it clear that download is a use case. (We probably need a separate issue to talk about running computation on files.) In the FRD above this is currently a question ("Do these carry over into a folder structure when downloaded as a zip?") and the answer for many users, I think, is that they want/expect to be able to upload a zip and later download a zip that has the same directory structure inside it. Some months ago @cchoirat was talking about the importance of this for her (though she may not have been talking about zip files specifically). It's a common expectation. Right now Dataverse flattens your files into a single namespace/directory on upload.

leeper commented 8 years ago

I think this would be really valuable. It was how things worked with versions < 4.0, as I recall, and makes it somewhat unpredictable what will happen currently when uploading a project (e.g., via the API).

One possibility might be to do what S3 does with object keys that can have slashes in them:

Note that the Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using keyname prefixes and delimiters as the Amazon S3 console does. The Amazon S3 console supports a concept of folders.

The examples they give of object keys are:

Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf

This would allow a "flat" Dataset to contain files that can be batch downloaded into a hierarchical structure. Of course, I don't know if that works on the backend.

pdurbin commented 8 years ago

This issue was raised yesterday by @pameyer and others from @sbgrid . @bmckinney if you want you could assign yourself to this issue to at least think about. I remember @dpwrussell of OMERO fame talking about it during the 2015 Dataverse Community Meeting.

@leeper you're right. From what I've heard from @landreev in the DVN 3.x days a zip download would sort of reconstruct the file system hierarchy. I'm obviously fuzzy on the details.

wddabc commented 8 years ago

I'm wondering whether there is a way to upload the entire directory (For example, by dragging a folder. Currently, it only supports dragging a file) so that the structure is maintained. The user can browse the directory and files by simply clicking into like Dropbox and github without explicitly downloading and unzipping the data.

jeisner commented 8 years ago

:+1: on this request. The directory structure is often important. Sometimes there are even multiple subdirectories that contain identically named files, e.g., for different experimental subjects or different versions of an experiment.

@pdurbin is correct that download is a use case. So is online browsing of the dataset to get a feel for what's there -- the directory structure provides very useful organization.

pdurbin commented 8 years ago

My biggest concern with this issue is how versioning of files would be supported. Imagine if you could just rsync some files up to Dataverse. Then you publish your dataset as 1.0. You add some more files and publish the 2.0 version of your dataset. How do you toggle back and forth between the files in 1.0 and 2.0? When you uploaded your files the first time did they go into a directory called "1"? Does that directory get copied to "2" and then you can start uploading additional files there? That sounds potentially expensive on a file system that doesn't do any de-duplication. Do we use ZFS and snapshot the directory whenever a dataset version is published? How do other systems handle this?

jeisner commented 8 years ago

@pdurbin I'm confused why hierarchical directory structure is related to versioning. If the directory is flat, don't you still have the problem that version 1.0 and version 2.0 might have some identical or similar files?

That said: The standard way to store identical files is symlinks. And the standard way to store similar files is to store only the latest version, together with diffs that make it possible to reconstruct the earlier versions on demand (as version-control systems from CVS to git have always done). Ideally, these storage details would be invisible to the user, who can just decide which version they want to grab (latest by default).

djbrooke commented 8 years ago

@pameyer brought this up on the community call 8/16 and asked if it would be included in 4.6. There are currently no plans to work on this for 4.6.

djbrooke commented 8 years ago

@cchoirat stopped by and asked about this today and reiterated its usefulness.

jeisner commented 8 years ago

Yes, please. We are about to upload a directory hierarchy. At present, we have to do it as a single tarball, which means that browsing / replacement / versioning of single files is not possible.

mheppler commented 8 years ago

Related to #3247

nmedeiro commented 8 years ago

Retention of directory structure in a .zip file is a critical component to our Project TIER protocol, where we teach students how to create reproducible empirical research. The inability to upload and download .zip files that retain this structures is a real impediment to our application of Dataverse as a platform to showcase these efforts. It seems to me that users upload .zip files either for convenience (many files at one time) or necessity (folder structure is important to retain). Providing an option -- to unzip or leave intact -- on upload might satisfy both communities.

pdurbin commented 8 years ago

@nmedeiro as a workaround, your students can "double zip" their files as discussed in #2107.

@jeisner we still owe you a response to your question at https://github.com/IQSS/dataverse/issues/2249#issuecomment-222241350 about how hierarchical directory structure is related to versioning. I'm not an authority on this but if you look at http://phoenix.dataverse.org/schemaspy/latest/tables/filemetadata.html (screenshot below), you'll see how rows in the datafile table are associated with rows in the filemetadata table, which are associated with rows in the datasetversion table. The filesystemname field in datafile is a random UUID which is what the file is renamed to on disk after it is uploaded. All these files that have been renamed to UUIDs on disk are stored in a single directory for each dataset (and different versions of the dataset can include these files per the associations above). I can keep going but I hope this gives a flavor of how the system works now.

screen shot 2016-09-02 at 9 00 06 am

@bmckinney is giving a demo next week of some potential changes in this area as part of #3145 which will help guide future direction.

mheppler commented 8 years ago

From @pameyer in #3247

Data files transferred through the Data Capture Module, data sets with data files that have directory structure available should display that structure.

Related to #2249, but distinct in that this doesn't require supporting re-organization of uploaded files (and possibly should support disabling such an option in the UI).

pdurbin commented 8 years ago

@nmedeiro see also this issue opened by @bjonnh about automating the "double zip" workaround: #3439

christophergandrud commented 7 years ago

Let me pile on some more for hierarchical file structure support :)

pdeffebach commented 6 years ago

Just want to point out that this issue is important for the mission of Dataverse. A key part of reproducibility is good data hygiene. And good data hygiene means nested folders with input, output, code etc. Having this feature makes Dataverse more consistent!

pdurbin commented 6 years ago

@pdeffebach thanks for the comment and for chatting over at http://irclog.iq.harvard.edu/dataverse/2018-01-11#i_62119 ! I'm glad the double zip workaround is working for you.

While I'm leaving a comment here, I thought I'd mention that there's also interest in the feature over at https://twitter.com/bshor/status/949417291132887041 which reads:

"@dataverseorg Is there any way to maintain the folder structure of studies in Dataverse? Seems mine was melted away."

I feel like there was another recent tweet but I can't find it. Suffice it to say there is broad interest in this feature.

setgree commented 6 years ago

To build on @pdeffebach's comment, here are some folks making the same point in general: http://kbroman.org/steps2rr/pages/organize.html "Perhaps the most important step to take towards ease of reproducibility is to be organized...Separate the data from the code. I prefer to put code and data in separate subdirectories."

http://www.fragilefamilieschallenge.org/author/matt-salganik/ "...we think it will be helpful to organize your input files, intermediate files, and output files into a standard directory (i.e., folder) structure. We think that this structure would help you create a modular research pipeline; see Peng and Eckel (2014) for more on the modularity of a research pipeline. This modular pipeline will make it easier for us to ensure computational reproducibility, and it will make it easy for other researchers to understand, re-use, and improve your code.

Here’s a basic structure that we think might work well for this project and others:

data/
code/
output/
README
LICENSE 

"

I really appreciate your openness to community feedback on this and related issues. (It took me a few tries to format this correctly, my apologies.)

pdurbin commented 6 years ago

@setgree this is a very useful comment. Thanks. When I have a minute I'll try to make a screenshot from @leeper 's talk at the 2017 Dataverse Community Meeting that shows the file hierarchy expected by his academic discipline (political science).

Actually, I'll make the screenshot from https://osf.io/xfj5h/ now. Here it is:

screen shot 2018-03-08 at 12 54 05 pm

I think concrete examples like this help explain the need for this feature.

setgree commented 6 years ago

That looks great! My only comment is that I do not think that makefiles are data; even one that preprocesses/cleans data is code, I think. (I was at that conference, BTW.) The ambiguity of such things is a good reason to allow readers a lot of flexibility in how they choose to subdivide. Anyway, this is just to say that I am looking forward to seeing nested folders on Dataverse.

pdurbin commented 6 years ago

Oh! I certainly don't get a chance to meet everyone at the community meeting.

I guess one other thing I'll mention is that over in https://github.com/IQSS/dataverse-client-r/issues/18 I made a little noise about a feature we had in DVN 3 (the predecessor to Dataverse 4) that allowed users to upload a zip file that gets expanded by DVN and then have other users download a zip of the files, but it didn't worked quite as well as I had hoped. Folders were renamed from folder1/sub1 to folder1-sub1 for example. The zip files are not quite identical. Anyway, I thought I'd mention that I looked into this at least. To me, supported zip upload and download would be a way to get some sort of support for file hierarchy, if it's implemented so that the zip files are as close as possible to being the same when they are uploaded and downloaded.

nmedeiro commented 6 years ago

Folder name and hierarchy are critical to reproducibility, so I'd love to see the option to retain zipped folders on upload. It may be that some users are uploading zips as a convenience, whereby extracting these in DVN is useful to them. For others, and especially for the work we do at Project TIER, retaining the zipped folder in tact is essential. We've be using the double-zip hack to achieve this, but would love to see a zip retention option in future versions of the software.

Best, Norm


Norm Medeiros Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Thu, Mar 8, 2018 at 8:24 PM, Philip Durbin notifications@github.com wrote:

Oh! I certainly don't get a chance to meet everyone at the community meeting.

I guess one other thing I'll mention is that over in IQSS/dataverse-client-r#18 https://github.com/IQSS/dataverse-client-r/issues/18 I made a little noise about a feature we had in DVN 3 (the predecessor to Dataverse 4) that allowed users to upload a zip file that gets expanded by DVN and then have other users download a zip of the files, but it didn't worked quite as well as I had hoped. Folders were renamed from folder1/sub1 to folder1-sub1 for example. The zip files are not quite identical. Anyway, I thought I'd mention that I looked into this at least. To me, supported zip upload and download would be a way to get some sort of support for file hierarchy, if it's implemented so that the zip files are as close as possible to being the same when they are uploaded and downloaded.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/2249#issuecomment-371682165, or mute the thread https://github.com/notifications/unsubscribe-auth/AL-rjGYl02IH1JDPTy03pMd-D4hgabD7ks5tcdnogaJpZM4E_vhX .

pameyer commented 6 years ago

Somehow I've managed to miss commenting on this issue until now:

mdehollander commented 6 years ago

I would be interested in using folders for a Dataset, and for example retaining the folder structure when dragging a folder with sub-folders to the upload box.

mheppler commented 6 years ago

More discussion on maintaining folders in these IRC logs. http://irclog.iq.harvard.edu/dataverse/2018-04-20

I wonder if it is possible to retain the folder structure? ... I think Nextcloud, Owncloud, Google Drive, Dropbox, almost all cloud based storage systems can handle a hierarchical folder structure.

But when you have hundreds of files in many folders this is still not going to reduce the number of files in the root folder I guess...

mdehollander commented 6 years ago

But when you have hundreds of files in many folders this is still not going to reduce the number of files in the root folder I guess...

This comment refers to the case the full path to a file is given

"/Example/Hierarchy/Structure/README.docx".

mheppler commented 6 years ago

Thanks for the additional info, @mdehollander. Please feel free to add any other details of this feature specific to your use case here in this GitHub issue. We're still in the early stages of researching and designing potential solutions, and your feedback will help in that process.

TaniaSchlatter commented 5 years ago

In the short term, we are considering using the file hierarchy information as metadata, stored in the database, rather than having the files in a hierarchy on disk. This would allow users to view the hierarchy in a preview, with the file display in the table (adding filtering and sorting capabilities).

This doesn't address all desired, however we are interested in getting comments on this proposal. Here is a more detailed description:

Depositor drags a zip (not double zip) in to the dataset. The file will unzip and preserve the directory structure (see #3448). Individual files will be ingested (if necessary) and displayed just like any other file in a dataset – flat. Individual files can be downloaded. If all or any files are downloaded, the hierarchy will be re-created in a zip, matching the structure of the file that was uploaded in the first place.

A user wanting to access data selects “Download all” and downloads the original zip hierarchy. The system behavior is transparent to depositors.

Add files

Move files Similar function to above, provide a way to edit the file path

Versioning Consider moving or adding a metadata change and display a new version in the version table File removed - same as any other file

View hierarchy Show a “preview” of the hierarchical contents of the dataset.

Replace/Unzip existing .zip For existing double .zip. Users can delete original .zip and then upload with a single .zip.

Download For Stata file add a toggle for original or ingested? Decide to show one? What about the download limit? How might that affect download? Can we leverage the S3/large/package file download UI?

pdurbin commented 5 years ago

@dpwrussell @pameyer @leeper @wddabc @jeisner @nmedeiro @christophergandrud @pdeffebach @setgree @mdehollander (and anyone else who is following this issue) good news! Dataverse 4.12 has support for organizing files into folders!

Can you all please try it out at https://demo.dataverse.org and give us feedback? Here are some screenshots that show how to introduce a folder hierarchy to your dataset's files:

Screen Shot 2019-04-05 at 7 26 47 AM Screen Shot 2019-04-05 at 7 27 03 AM

This feature is documented as "File Path" at http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path and here's a screenshot of the docs:

Screen Shot 2019-04-05 at 7 32 27 AM

Please just leave a comment below! Thanks!

mdehollander commented 5 years ago

@pdurbin Good that this works now with a zip file. Ideally I would like to see this also working with drag&drop. And that you can browser through folders in the interface in stead of listing the folder name for each file. But hey, thanks for making this already possible!

pdurbin commented 5 years ago

@mdehollander great suggestion! Please feel free to open a new issue for this.

Everyone, while I'm writing I'll mention that I also wrote about the progress so far in this "Control over dataset file hierarchy + directory structure (new feature in Dataverse 4.12)" thread and feedback is welcome there as well: https://groups.google.com/d/msg/dataverse-community/8gn5pq0cVc0/MCMQAQHRAQAJ

If anyone want to reply via Twitter, I would suggest piling on to one of these tweets:

We're currently working on "Enable the display of file hierarchy metadata on the dataset page" in #5572.

nmedeiro commented 5 years ago

Phil, this is great! It worked perfectly with the test dataset I uploaded to the demo site. Thanks very much to you and your team for getting this much-needed functionality into Dataverse. It's critical to the computational reproducibility we're teaching.

All the best, Norm


Norm Medeiros Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Fri, Apr 5, 2019 at 7:35 AM Philip Durbin notifications@github.com wrote:

@dpwrussell https://github.com/dpwrussell @pameyer https://github.com/pameyer @leeper https://github.com/leeper @wddabc https://github.com/wddabc @jeisner https://github.com/jeisner @nmedeiro https://github.com/nmedeiro @christophergandrud https://github.com/christophergandrud @pdeffebach https://github.com/pdeffebach @setgree https://github.com/setgree @mdehollander https://github.com/mdehollander (and anyone else who is following this issue) good news! Dataverse 4.12 has support for organizing files into folders!

Can you all please try it out at https://demo.dataverse.org and give us feedback? Here are some screenshots that show how to introduce a folder hierarchy to your dataset's files:

[image: Screen Shot 2019-04-05 at 7 26 47 AM] https://user-images.githubusercontent.com/21006/55624812-fcc36f00-5774-11e9-927c-1a5747ea98da.png

[image: Screen Shot 2019-04-05 at 7 27 03 AM] https://user-images.githubusercontent.com/21006/55624811-fc2ad880-5774-11e9-97d3-cb8c6d504e97.png

This feature is documented as "File Path" at http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path and here's a screenshot of the docs:

[image: Screen Shot 2019-04-05 at 7 32 27 AM] https://user-images.githubusercontent.com/21006/55624810-fc2ad880-5774-11e9-9123-09983378ba2a.png

Please just leave a comment below! Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/2249#issuecomment-480243678, or mute the thread https://github.com/notifications/unsubscribe-auth/AL-rjETidrUUwPj-aDJ7yZwbYsoN2eR-ks5vdzT1gaJpZM4E_vhX .

pdurbin commented 5 years ago

@nmedeiro fantastic! If you have a public sample zip file with a folder hierarchy that you give to your students that we can also use in our own testing, please let us know where to download it. 😄

Yes, I've been thinking that this is an important step toward more automated reproducibility. Code Ocean, for example, wants a "data" folder and "code" folder, as I wrote about at https://github.com/IQSS/dataverse/issues/4714#issuecomment-443344987 . Here's a screenshot:

49315649-70e6c100-f4bc-11e8-9c04-9034186e1571

nmedeiro commented 5 years ago

I loaded one to the demo site

https://doi.org/10.5072/FK2/86JG25

Feel free to use for testing.


Norm Medeiros Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Fri, Apr 5, 2019 at 12:00 PM Philip Durbin notifications@github.com wrote:

@nmedeiro https://github.com/nmedeiro fantastic! If you have a public sample zip file with a folder hierarchy that you give to your students that we can also use in our own testing, please let us know where to download it. 😄

Yes, I've been thinking that this is an important step toward more automated reproducibility. Code Ocean, for example, wants a "data" folder and "code" folder, as I wrote about at #4714 (comment) https://github.com/IQSS/dataverse/issues/4714#issuecomment-443344987 . Here's a screenshot:

[image: 49315649-70e6c100-f4bc-11e8-9c04-9034186e1571] https://user-images.githubusercontent.com/21006/55640790-4f635200-579a-11e9-8305-3ce74daf0936.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/2249#issuecomment-480329943, or mute the thread https://github.com/notifications/unsubscribe-auth/AL-rjOhkgCDMUMSRVbbmzIDT5d2bkDgPks5vd3MdgaJpZM4E_vhX .

pdurbin commented 5 years ago

@nmedeiro thanks! It's only 6.5 MB. Can I make it public by attaching it to this issue?

nmedeiro commented 5 years ago

Sure.


Norm Medeiros Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Fri, Apr 5, 2019 at 12:46 PM Philip Durbin notifications@github.com wrote:

@nmedeiro https://github.com/nmedeiro thanks! It's only 6.5 MB. Can I make it public by attaching it to this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/2249#issuecomment-480344489, or mute the thread https://github.com/notifications/unsubscribe-auth/AL-rjIHMbeXCe9MJhqN_nlNNtqZsglCVks5vd33PgaJpZM4E_vhX .

pdurbin commented 5 years ago

@nmedeiro thanks! Here it is: dataverse_files.zip

Inside the "Replication Documentation for Midlife Crisis Paper" directory are the following files:

Original-Data/importable-pew.dta
Original-Data/original-pew.sav
Original-Data/original-wdi.xlsx
Command-Files/5-analysis.do
Command-Files/4-data-appendix.do
Analysis-Data/country-analysis.dta
Analysis-Data/individual-analysis.dta
Introduction to the Tier Protocol (v. 3.0) demo project. 2017-07-11.pdf
Introduction to the Tier Protocol (v. 3.0) demo project. 2017-07-11.docx
djbrooke commented 5 years ago

Thanks all for the feedback as we evaluated and implemented this in Dataverse. Very exciting to see this feature added.

5572 (view hierarchy) has been merged and will be included in the next release. Retaining file hierarchy for zips and and editing hierarchy has been added in previous releases so I'm closing this issue.

pdurbin commented 5 years ago

@nmedeiro here's how the files and folders look in the "tree" view we shipped in Dataverse 4.13:

Screen Shot 2019-05-10 at 5 57 45 AM

Thanks again!

nmedeiro commented 5 years ago

Beautiful! Thanks for your efforts with this. It's very important to our work with students.


Norm Medeiros Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Fri, May 10, 2019 at 5:58 AM Philip Durbin notifications@github.com wrote:

@nmedeiro https://github.com/nmedeiro here's how the files and folders look in the "tree" view we shipped in Dataverse 4.13:

[image: Screen Shot 2019-05-10 at 5 57 45 AM] https://user-images.githubusercontent.com/21006/57519094-8fad7700-72e8-11e9-8e1f-a49dbc9fff05.png

Thanks again!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/2249#issuecomment-491233369, or mute the thread https://github.com/notifications/unsubscribe-auth/AC72XDAN3ZIG4ESOFOPRF5LPUVBNTANCNFSM4BH67BLQ .

setgree commented 8 months ago

Hi, so the canonical solution to this problem is to upload a zip file? I was trying to upload some files and folders recently -- which I've organized carefully in order to ensure reproducibility -- and I was unable to figure out how to upload the files in a nested way.

pdurbin commented 8 months ago

@setgree you can also use https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader or https://github.com/gdcc/python-dvuploader

Or if you're in control of the installation, you can install https://github.com/gdcc/dvwebloader

setgree commented 8 months ago

Thank you, I appreciate your quick response.This answer surprised me. IMHO:

1) reproducibility is a core goal/function of Dataverse;

2) good organizational hygiene is essential for computational reproducibility;

3) The tools you have shared are all, IMO, workarounds, and not integrated into the default way that a person would use dataverse -- which I understand to be A) uploading files via the browser interface B) minting a DOI and C) putting that DOI in the accompanying paper -- nor surfaced to a user trying to upload files.

Does the Dataverse team intend to integrate folder preservation into the default flow, or is the team happy with the way things stand?

(Perhaps this has been discussed elsewhere, my apologies if I missed it)

qqmyers commented 8 months ago

FWIW: There are potential security issues in allowing web apps to scan your disk for files. The DVWebloader uses a ~defacto standard that is supported by most browsers to allow you to upload a whole directory, after clicking OK in a browser mandated popup. (Conversely, when the user specifies the exact files involved as in our normal upload, no popup is required, but an app doesn't get to know the local path.) I'm sure with the work on creating a new React front end for Dataverse, we'll be looking at supporting directories more cleanly, as possible. (Also note w.r.t. surfacing - when DVWebloader is installed, the upload page shows an 'Upload a Folder' option, so it is visible.)

pdurbin commented 8 months ago

As @qqmyers says, DVWebloader already integrates folder preservation into the default flow, but it's an optional component that needs to be installed (see https://guides.dataverse.org/en/6.1/user/dataset-management.html#folder-upload ). If you're curious what it looks like, there are some screenshots in this pull request:

And yes, I agree that when we get to implementing file upload in the new frontend ( https://github.com/IQSS/dataverse-frontend ), we should strongly consider folder upload. Better reproducibility without workarounds. 100%.

@setgree all this is to say, yes, we are fully supportive of your ideas! 😄

As far as things being discussed elsewhere, a good place for discussion is https://groups.google.com/g/dataverse-community or https://chat.dataverse.org . You are very welcome to join and post!

jeisner commented 8 months ago

Just a remark that if Dataverse were being built today, it would undoubtedly be built on top of git. Obviously git already handles all of the concerns above, including directory structure and avoiding duplicate storage between similar versions, so reinventing all the functionality may be unnecessary.

To use git-Dataverse, a project would need to host its own git repo anywhere else. It could be a private repo. That repo would tag a small number of revisions as releases. git-Dataverse would then host a public, read-only "sparse mirror" that contained only the release revisions (and only the public parts of them) but was guaranteed to be archival, which is the point of Dataverse, I think? So a user of the sparse mirror could download a snapshot -- or could download the whole sparse mirror and see the diffs between releases.

I am not sure how to construct such a sparse mirror, which collapses the intermediate history between releases and removes private material from each release. However, https://github.com/newren/git-filter-repo looks like a possible starting point.

BTW, this is a feature that I could imagine github providing -- a kind of compromise between public and private repos -- but maybe they don't do this because they want to encourage open-source development, with fully public repos. Even if they did provide it, Dataverse may support bigger datasets and may have other features I don't know about.