Closed djbrooke closed 1 month ago
I'm interested in learning what is different about this this type of object compared with others, to help get at UI details and possibilities. What are the content items related to this type of file/object?
Also, what do users expect (if anything) about this type of object? Do they expect to see it as a unit like a package file, or like a container (folder) with contents?
Hi @TaniaSchlatter - thanks for talking about this briefly earlier today.
As an example, take this replication dataset:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/Y3XHB6
We'd want to provide the ability to deposit a capsule (data, code, prov, compute environment) of this dataset as one object so that replication tools such as Code Ocean can run it. At the same time, we'd also want it unzipped and displayed as it is now so that the individual files can take advantage of our external tool infrastructure and so that users that are perhaps just interested in some data files but not the analysis (or vice versa) can pick and choose. I liked your idea of adding the capsule view as a third view here and having some appropriate view once it's selected:
To answer your question about what I'd expect users to be able to do with it, I'd expect it to be downloaded by a user through the UI/API or some tool using our API.
I'm checking with our hosting team at Harvard about how much storage cost we're racking up a month to try and determine the implications of hosting two versions of each dataset.
A question from an architecture standpoint is whether or not we keep the full environment with each dataset or we package up the appropriate environment at the time that the capsule is requested. I do not know which is preferred from a preservation standpoint or from an efficiency standpoint. If we keep the full environment for each dataset (1000 copies of Stata 14 or whatever :)) there may be further storage cost issues. But, if we keep each capsule together with the environment and everything else we can possibly more easily serve them from S3.
Arbitrarily chosen examples of research capsules from CodeOcean:
User interface of capsules stored on Dockerhub:
I checked in with @xarthisius and here are three examples of capsules (which they call "tales") created with Whole Tale:
He also said,
"there's nothing special about Tales/Capsules published somewhere, apart from the fact that they have DOI.
you can go to https://dashboard.wholetale.org and export file as BagIt (zip) locally
the content is the same as the data we "publish" to external repository, i.e. that would be the thing that would land in Dataverse"
And to that I would add that from https://dev2.dataverse.org anyone is able to create a dataset and click the "Explore" button to play with it in Whole Tale and then click "Export as BagIt" like in the screenshot below.
This is what I got from my dataset when I exported it as BagIT:
$ unzip 5dc089a87bf5ca3bf549e3dd.zip
Archive: 5dc089a87bf5ca3bf549e3dd.zip
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/data/irclog.tsv
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/apt.txt
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/index.ipynb
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/install.R
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/README.md
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/runtime.txt
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/superuser_graph.ipynb
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/superuser_graph-monthly.ipynb
extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/index.ipynb
extracting: 5dc089a87bf5ca3bf549e3dd/run-local.sh
extracting: 5dc089a87bf5ca3bf549e3dd/data/LICENSE
extracting: 5dc089a87bf5ca3bf549e3dd/README.md
extracting: 5dc089a87bf5ca3bf549e3dd/bagit.txt
extracting: 5dc089a87bf5ca3bf549e3dd/bag-info.txt
extracting: 5dc089a87bf5ca3bf549e3dd/fetch.txt
extracting: 5dc089a87bf5ca3bf549e3dd/manifest-md5.txt
extracting: 5dc089a87bf5ca3bf549e3dd/manifest-sha256.txt
extracting: 5dc089a87bf5ca3bf549e3dd/metadata/environment.json
extracting: 5dc089a87bf5ca3bf549e3dd/metadata/manifest.json
extracting: 5dc089a87bf5ca3bf549e3dd/tagmanifest-md5.txt
extracting: 5dc089a87bf5ca3bf549e3dd/tagmanifest-sha256.txt
The files I uploaded to Dataverse are at https://github.com/pdurbin/dataverse-irc-metrics and they are shown in a folder called "dataverse-irc-metrics-master" above. To get them into Dataverse, I downloaded my GitHub repo as a zip and added it to my dataset.
Meeting notes from 11/4 below. I see everyone has been completing action items as I've been out walking the dog :)
https://docs.google.com/document/d/1hF93XtIkacD6HE0koeoBtqk9FUfJhalhd6EtvD4nlnk/edit
My one item was to get some more details on Renku and I'm working on setting up a meeting this week. Generally, once we have examples of capsules and capsule-equivalents from around the community, we'll get back together.
https://github.com/whole-tale/whole-tale/issues/53 is the "publishing tales/capsules from Whole Tale to Dataverse" issue to track and there are lots of great screenshots in there.
My one item was to get some more details on Renku and I'm working on setting up a meeting this week.
Here is where Renku is tracking this: https://github.com/SwissDataScienceCenter/renku-python/issues/668
There was so much great information, screenshots and chatter yesterday from @craig-willis in https://github.com/whole-tale/whole-tale/issues/53 that I suggested to him that we should consider scheduling a call with Whole Tale to get their take on depositing capsules into Dataverse.
@craig-willis maybe we should schedule the 3rd Open Science Infrastructure working group call? https://github.com/whole-tale/whole-tale/issues/61
Or maybe we could ask @KirstieJane if we could dedicate a future "Turing Way online Collaboration Cafe" to the topic of depositing capsules into data repositories? Here are the upcoming dates and times: https://github.com/alan-turing-institute/the-turing-way/blob/master/project_management/online-collaboration-cafe.md#dates-and-start-times . I did my best to introduce Dataverse to the Turing Way communing about a month ago in https://www.youtube.com/watch?v=HIIJvDZ8pzw . The advantage of the collaboration cafe is that the meetings are recorded so I can very easily add them to DataverseTV: https://github.com/IQSS/dataverse-tv 😄 If the call is recorded, we'll have much more reach.
@pdurbin I'm happy to try to coordinate call or participate in a related community call. I do now have access to Zoom for recording, but may not have the reach of the "Collaboration Cafe".
I've started to add images of capsules and notes from discussions to a presentation doc. If you have a representative image, you can add: https://docs.google.com/presentation/d/16Blkgb1ozjIijx-jv_3QtvgQhAlvHXDiH5ZjtK5WJrw/edit?usp=sharing
Document with a proposed approach, comments welcome: https://docs.google.com/document/d/1xG8xAcPSOe1xCWUlhj46AKrK4MAZbY6ed96yBKHCXiA/edit
@stain might have some constructive input from the RO-crate perspective, if not already involved.
Related:
2024/09/30: Currently looking into frictionless packaging as part of GREI work, we plan to support RO-Crate therefore we are closing as not planned because there is other work underway to address it.
We'll need to talk about the specific steps with @atrisovic when she gets here, but I'm putting in this placeholder for now. We'd like to evaluate how we can better support/display capsules in Dataverse, such as those used by: