IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
879 stars 492 forks source link

Capsulization and Packaging of Replication Objects in Dataverse #6085

Closed djbrooke closed 1 month ago

djbrooke commented 5 years ago

We'll need to talk about the specific steps with @atrisovic when she gets here, but I'm putting in this placeholder for now. We'd like to evaluate how we can better support/display capsules in Dataverse, such as those used by:

TaniaSchlatter commented 5 years ago

I'm interested in learning what is different about this this type of object compared with others, to help get at UI details and possibilities. What are the content items related to this type of file/object?

Also, what do users expect (if anything) about this type of object? Do they expect to see it as a unit like a package file, or like a container (folder) with contents?

djbrooke commented 5 years ago

Hi @TaniaSchlatter - thanks for talking about this briefly earlier today.

As an example, take this replication dataset:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/Y3XHB6

We'd want to provide the ability to deposit a capsule (data, code, prov, compute environment) of this dataset as one object so that replication tools such as Code Ocean can run it. At the same time, we'd also want it unzipped and displayed as it is now so that the individual files can take advantage of our external tool infrastructure and so that users that are perhaps just interested in some data files but not the analysis (or vice versa) can pick and choose. I liked your idea of adding the capsule view as a third view here and having some appropriate view once it's selected:

Screen Shot 2019-10-08 at 10 20 35 PM

To answer your question about what I'd expect users to be able to do with it, I'd expect it to be downloaded by a user through the UI/API or some tool using our API.

I'm checking with our hosting team at Harvard about how much storage cost we're racking up a month to try and determine the implications of hosting two versions of each dataset.

A question from an architecture standpoint is whether or not we keep the full environment with each dataset or we package up the appropriate environment at the time that the capsule is requested. I do not know which is preferred from a preservation standpoint or from an efficiency standpoint. If we keep the full environment for each dataset (1000 copies of Stata 14 or whatever :)) there may be further storage cost issues. But, if we keep each capsule together with the environment and everything else we can possibly more easily serve them from S3.

atrisovic commented 5 years ago

Arbitrarily chosen examples of research capsules from CodeOcean:

atrisovic commented 5 years ago

User interface of capsules stored on Dockerhub:

image

https://hub.docker.com/_/r-base

pdurbin commented 5 years ago

I checked in with @xarthisius and here are three examples of capsules (which they call "tales") created with Whole Tale:

He also said,

"there's nothing special about Tales/Capsules published somewhere, apart from the fact that they have DOI.

you can go to https://dashboard.wholetale.org and export file as BagIt (zip) locally

the content is the same as the data we "publish" to external repository, i.e. that would be the thing that would land in Dataverse"

And to that I would add that from https://dev2.dataverse.org anyone is able to create a dataset and click the "Explore" button to play with it in Whole Tale and then click "Export as BagIt" like in the screenshot below.

Screen Shot 2019-11-04 at 3 46 54 PM

This is what I got from my dataset when I exported it as BagIT:

$ unzip 5dc089a87bf5ca3bf549e3dd.zip 
Archive:  5dc089a87bf5ca3bf549e3dd.zip
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/data/irclog.tsv  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/apt.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/index.ipynb  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/install.R  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/README.md  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/runtime.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/superuser_graph.ipynb  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/superuser_graph-monthly.ipynb  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/index.ipynb  
 extracting: 5dc089a87bf5ca3bf549e3dd/run-local.sh  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/LICENSE  
 extracting: 5dc089a87bf5ca3bf549e3dd/README.md  
 extracting: 5dc089a87bf5ca3bf549e3dd/bagit.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/bag-info.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/fetch.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/manifest-md5.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/manifest-sha256.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/metadata/environment.json  
 extracting: 5dc089a87bf5ca3bf549e3dd/metadata/manifest.json  
 extracting: 5dc089a87bf5ca3bf549e3dd/tagmanifest-md5.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/tagmanifest-sha256.txt  

The files I uploaded to Dataverse are at https://github.com/pdurbin/dataverse-irc-metrics and they are shown in a folder called "dataverse-irc-metrics-master" above. To get them into Dataverse, I downloaded my GitHub repo as a zip and added it to my dataset.

djbrooke commented 5 years ago

Meeting notes from 11/4 below. I see everyone has been completing action items as I've been out walking the dog :)

https://docs.google.com/document/d/1hF93XtIkacD6HE0koeoBtqk9FUfJhalhd6EtvD4nlnk/edit

My one item was to get some more details on Renku and I'm working on setting up a meeting this week. Generally, once we have examples of capsules and capsule-equivalents from around the community, we'll get back together.

pdurbin commented 5 years ago

https://github.com/whole-tale/whole-tale/issues/53 is the "publishing tales/capsules from Whole Tale to Dataverse" issue to track and there are lots of great screenshots in there.

pdurbin commented 5 years ago

My one item was to get some more details on Renku and I'm working on setting up a meeting this week.

Here is where Renku is tracking this: https://github.com/SwissDataScienceCenter/renku-python/issues/668

pdurbin commented 5 years ago

There was so much great information, screenshots and chatter yesterday from @craig-willis in https://github.com/whole-tale/whole-tale/issues/53 that I suggested to him that we should consider scheduling a call with Whole Tale to get their take on depositing capsules into Dataverse.

@craig-willis maybe we should schedule the 3rd Open Science Infrastructure working group call? https://github.com/whole-tale/whole-tale/issues/61

Or maybe we could ask @KirstieJane if we could dedicate a future "Turing Way online Collaboration Cafe" to the topic of depositing capsules into data repositories? Here are the upcoming dates and times: https://github.com/alan-turing-institute/the-turing-way/blob/master/project_management/online-collaboration-cafe.md#dates-and-start-times . I did my best to introduce Dataverse to the Turing Way communing about a month ago in https://www.youtube.com/watch?v=HIIJvDZ8pzw . The advantage of the collaboration cafe is that the meetings are recorded so I can very easily add them to DataverseTV: https://github.com/IQSS/dataverse-tv 😄 If the call is recorded, we'll have much more reach.

craig-willis commented 5 years ago

@pdurbin I'm happy to try to coordinate call or participate in a related community call. I do now have access to Zoom for recording, but may not have the reach of the "Collaboration Cafe".

TaniaSchlatter commented 4 years ago

I've started to add images of capsules and notes from discussions to a presentation doc. If you have a representative image, you can add: https://docs.google.com/presentation/d/16Blkgb1ozjIijx-jv_3QtvgQhAlvHXDiH5ZjtK5WJrw/edit?usp=sharing

djbrooke commented 4 years ago

Document with a proposed approach, comments welcome: https://docs.google.com/document/d/1xG8xAcPSOe1xCWUlhj46AKrK4MAZbY6ed96yBKHCXiA/edit

craig-willis commented 4 years ago

@stain might have some constructive input from the RO-crate perspective, if not already involved.

pdurbin commented 2 years ago

Related:

cmbz commented 1 month ago

2024/09/30: Currently looking into frictionless packaging as part of GREI work, we plan to support RO-Crate therefore we are closing as not planned because there is other work underway to address it.