IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 490 forks source link

Dataset Explore #5028

Closed djbrooke closed 5 years ago

djbrooke commented 6 years ago

Dataverse has the infrastructure to support community built, file-level external tools (TwoRavens, Data Explorer). Similar infrastructure should be added at the dataset level to support the Code Ocean integration funded by the recent Sloan grant. Additionally, supporting Binderverse (#4714), SBGrid's reprocessing tool (ping @pameyer), and other future tools should be considered.

pdurbin commented 6 years ago

This issue represents the initial step toward integration of Dataverse with Code Ocean (Sloan grant) and hopefully related tools such as Binder and Whole Tale (community efforts) so I'm going to leave a bit of a brain dump of recent happenings.

On Tuesday during our regular community call, the Code Ocean team shared their screen and we talked through the future integration at a pretty high level. Notes can be found at https://groups.google.com/d/msg/dataverse-community/HPLziKZbOAc/q_XEqyKEBwAJ

Yesterday @djbrooke and I called in to the first meeting of the Open Science Infrastructure Working Group organized by @craig-willis from @whole-tale to discuss a variety of computation and reproducibility topics with @aprilcs (Code Ocean) @donsizemore and @tlchristian (Odum) @choldgraf and @aculich (Binder #4714) @craig-willis , @Xarthisius and @amoeba (Whole Tale #5097) and others. Notes at https://docs.google.com/document/d/1bOVWBfhOiKGU2dYoHN_Pkpv5zPI6y819UISfkTqYpoQ/edit?usp=sharing

This morning I attended a fantastic Code Ocean workshop given by @aprilcs and I'd be happy to walk anyone through her tutorial which is captured very nicely in her slides ( http://bit.ly/harvard-oa-week ), example capsule ( https://bit.ly/2NanJLc ), and git repo ( https://github.com/aprilcs/candy_trade ). I'd also like to note that she features https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EZSJ1S as an example of a dataset that has an excellent README describing how to reproduce results using code and data in the dataset.

@mheppler and discussed this issue this morning and to me the next logical step is to make a decision on where on the dataset page the button should go that brings the user from the dataset to Code Ocean, Whole Tale, Binder, etc., sort of like how the "Explore" button can bring you to multiple external tools at the file level such as Data Explorer and Two Ravens:

screen shot 2018-10-25 at 3 54 33 pm

pdurbin commented 5 years ago

@craig-willis gave us some great feedback on how he assumed the external tool manifest would allow him express to Dataverse how to compose the final URL. This is what he tried:

"queryParameters": [
  {
    "url": "{siteUrl}/api/access/datafile/{fileId}?key={apiToken}"
  }
]

Instead, for now, he'll have to compose the final URL on his side, much like how Data Explorer does. For example base_url=detailsURL.siteUrl+"/api/access/datafile/" at https://github.com/scholarsportal/Dataverse-Data-Explorer/blob/v1.0/assets/js/controllers/details.js#L481

For completeness, here's how the query parameters look in the Data Explore external tool manifest:

"queryParameters": [
   {    
    "fileId": "{fileId}"    
   },   
   {    
    "siteUrl": "{siteUrl}"  
   },   
   {    
    "key": "{apiToken}" 
   }
]

When we work on this issue #5028 we should at least consider this feedback.

pdurbin commented 5 years ago

This morning I attended a fantastic Code Ocean workshop

Just a heads up that the Code Ocean interface has completely changed (though the old interface I learned is still available) and is now based on @jupyterlab: https://medium.com/codeocean/new-jupyterlab-based-capsule-page-d618f34bc636

pdurbin commented 5 years ago

There's lots more discussion on Code Ocean on #4714 especially starting at https://github.com/IQSS/dataverse/issues/4714#issuecomment-442633005

pdurbin commented 5 years ago

There's lots of excitement on Twitter about Nature Scientific Data's "Call for submissions: Reproducible data processing" announcement: https://twitter.com/mercecrosas/status/1072899669074821122

pdurbin commented 5 years ago

I'm blocked on my dream of demo'ing the launching and execution of Jupyter Notebooks from Dataverse using Whole Tale until we implement external tools at the file level or until Whole Tale picks up this issue they just asked me to open as a workaround: https://github.com/whole-tale/whole-tale/issues/66

djbrooke commented 5 years ago

See https://github.com/IQSS/dataverse/issues/5028#issuecomment-512305360 for an updated list.

Before development begins we need to:

pdurbin commented 5 years ago

This is a good list.

I would add that we should check in with applications we have already integrated with that operate on all files in the dataset (Whole Tale, Mass Open Cloud) and those that we want to (Code Ocean, Binder) and make sure we're on the same page with regard to the URL the user will land on for the external tool.

Yesterday at https://github.com/jupyterhub/binderhub/issues/900#issuecomment-511558098 a Binder developer indicated that a the dataset level he is (preliminarily) hoping Dataverse users will be sent to URLs like this:

https://mybinder.org/v2/dataverse/10.7910/DVN/RLLL1V

(I picked that dataset because I see some Jupyter Notebooks in it and because its from AJPS so the content has been curated has the note "This dataset underwent an independent verification process that replicated the tables and figures in the primary article.")

If anyone is using the "Compute" button I don't know of it (the MOC installation seems to be down) but it would be good to revisit what the URLs look like that the user is sent to. If memory serves, there was a query parameter for the Swift container which looked something like a DOI. This is not implemented as an external tool but maybe it should be.

So, in summary, here's what I would add to the list:

djbrooke commented 5 years ago

Before development begins we need to:

Questions

pdurbin commented 5 years ago

I just created a spreadsheet called "External Tools" to help in classifying the state of the various tools we talked about this morning and that are otherwise on our radar: https://docs.google.com/spreadsheets/d/1OwIxpgpWVPDPSFwDsnPfk8ivNRXUIiaAlnbFccCnfsQ/edit?usp=sharing

Here's a screenshot:

Screen Shot 2019-07-17 at 1 04 57 PM

mercecrosas commented 5 years ago

@djbrooke I think of externals tools and compute as having the same user workflow - that is, you can select one or more files and click "compute at ..." as you would say "run/open in CodeOcean". It is the same concept for the user - he/she chooses the data to be open and used in another tool or environment. However, in the backend for computing the data files might be moved directly from the storage to the compute resources (or not moved), possibly in a different way than it would be done when the files are stored locally.

scolapasta commented 5 years ago

Met at tech hours, we decided we would just add another column for "scope": dataset or file.

We will also clearly need to modify logic to not require file id for dataset tools.

@pdurbin also brought up the idea of being able to send info in not just as query parameters but in a more RESTful way (by request of Whole tale). We decided that we will work on this after we have the initial ability to support dataset tools in general.

pdurbin commented 5 years ago

Oh, I was advocating for it to be in scope for this issue to support putting DOIs and other values in the "path" like the https://mybinder.org/v2/dataverse/10.7910/DVN/RLLL1V example above. I thought we said we'd extend the "toolParameters" definition for this.

djbrooke commented 5 years ago

Thanks. The checklist above (https://github.com/IQSS/dataverse/issues/5028#issuecomment-512305360) is all finished. We'll bring this to Sprint Planning tomorrow.

djbrooke commented 5 years ago
mheppler commented 5 years ago

Wired up placeholder Explore btn in the top, action btn section of the dataset pg. Ready to wire up to the backend up. Included some comments about needed render logic and ui:repeat component.

Included exploreTools.size()>1 render logic to show dropdown vs single btn depending on how many tools are configured, that is currently used at the file level in file-download-button-fragment.xhtml.

Screen Shot 2019-07-25 at 12 34 48 PM

pdurbin commented 5 years ago
"url": "{siteUrl}/api/access/datafile/{fileId}?key={apiToken}"

I just noticed above that @craig-willis is also interested in being able to modify the path. In the example above, he assumed he'd be able to add the file id to the path.

I just made pull request #6059 but I did not implement the ability to modify the path. I'm going on vacation next week so it was quicker to just get something working. If someone else wants to hack on the code further and add the ability to manipulate the path I think that would be great as it makes Dataverse's external tools much more flexible.

I did implement a new keywork so that you can pass the DOI or Handle to an external tool.

pdurbin commented 5 years ago

I deployed the code to https://ec2-52-91-77-202.compute-1.amazonaws.com/dataset.xhtml?persistentId=doi:10.5072/FK2/92LNOI and here's how it looks:

Screen Shot 2019-07-26 at 12 58 54 PM
pdurbin commented 5 years ago

Two things.

@mercecrosas @djbrooke and I had a nice meeting with Code Ocean yesterday and we invited them to think about external tools at the dataset level and comment here if they'd like. Afterward I knocked together this VERY PRELIMINARY diagram for one of the three main use cases we talked about:

codeocean-reproducibility

This morning I reached out to Renku to let them know that external tools at the dataset level are coming. @rokroskar just shared some thoughts about two potential use cases or user stories at https://github.com/SwissDataScienceCenter/renku-python/issues/536#issuecomment-519892328

pdurbin commented 5 years ago

Pull request #6059 was merged which means that explore tools at the dataset level will be available in the next release!

It looks like we forgot to close this issue (I forgot the "closes" syntax in the pull request) but I just assigned this to myself to remind me to reach out to all the external tool makers to let them know that they will need to update their manifest files to add "scope" to be compatible with future versions of Dataverse.

pdurbin commented 5 years ago

As I mentioned at standup today, I've been reaching out to all the external tool makers to let them know that the next release of Dataverse will require a "scope" in the external tool manifest. I'm closing this issue now that I've created the issues below: