benmarwick / rrtools

rrtools: Tools for Writing Reproducible Research in R
Other
671 stars 85 forks source link

use_data_repository #15

Closed benmarwick closed 4 years ago

benmarwick commented 7 years ago

There might be a place for a use_zenodo https://github.com/ropensci/zenodo/blob/master/README.md

MartinHinz commented 7 years ago

Great Idea! The way would than be:

Because this way we would not need any additional dependencies on the local computer, but we would be dependent on dockerhub to process the files for us!?

dakni commented 7 years ago

indeed!

You can login to Zenodo using github; an access token is easily generated under "applications".

@MartinHinz why using the file from dockerhub? when the docker container is successfully build on travis one can directly use this one. Or did I understood some parts of the workflow wrong? This way one does not need to store further environmental variables on travis.

MartinHinz commented 7 years ago

Right, damn, correct. So it is accessing from travis, not from dockerhub. My mistake

dakni commented 7 years ago

Here is a nice blog post for Zenodo and github [though without API]: http://computationalproteomic.blogspot.de/2014/08/making-your-code-citable.html

--> basically there is more in archiving then just uploading. since we want a DOI etc. perhaps one should put the detailed instructions in how to connect to Zenodo and create a function that creates a container ready for upload to Zenodo?

btw: ropensci package causes an error when trying to create a repo [my zenodo token is chosen per default]:

zen_create("test")
Error in handle_url(handle, url, ...) : 
  Must specify at least one of url or handle
MartinHinz commented 7 years ago

Ignore my last post, still not fully familiar with the docker concept, I suppose

benmarwick commented 7 years ago

I was imagining this being an infrequent, deliberate, action, not part of the continuous integration cycle.

For example when you submit your article for peer review, you use_zendo() to create a json metadata file, create a repo on https://www.zenodo.org/, and push the whole project to that repo. Then you get a snapshot of the project at that moment, with a DOI to put in the text of the paper.

Then, after peer review and your paper is accepted :tada:, you use_zenodo() again to update the repo with the final file set. I think zenodo has versioned repos, so you can have the same DOI for the repo, but different hashes for each version.

MartinHinz commented 7 years ago

From what I read, Zenodo archiving something from Github is connected to making a release. And if the connection exists, Zenodo than makes a snapshot from every released version. So is this our way to go? See eg https://github.com/ropensci/RNeXML/issues/96

benmarwick commented 7 years ago

I think we can do it directly from our console to zenodo. But it looks like the zenodo package actually does not have any functions we can use yet (https://github.com/ropensci/zenodo/issues/14). So let's put this on hold until that pkg gets a bit more love.

nevrome commented 7 years ago

Just a minor comment independent of the actual implementation: In my opinion functions like use_zenodo() should have a certain threshold of control questions like devtools::release(). It's a pretty big thing to release a paper on zenodo. A set of questions can prevent accidental and immature releases.

benmarwick commented 7 years ago

Yes, that is an excellent suggestion, I agree. I guess that @karthik has something like that already in mind for zenodo::zen_file_publish

MartinHinz commented 7 years ago

Wouldn't it be in our case the most convenient way a two step process, with

benmarwick commented 7 years ago

Yes, that could work. It seems more natural to me to connect from the local repo on my computer directly to zenodo, without counting on GitHub in the middle. That would be simpler and more flexible, to me at least. But let's see what direction they take with the zenodo pkg as it develops further.

MartinHinz commented 7 years ago

I give you that. The whole thing is centered around Github, so it seemed to me natural to use the existing link Zenodo <-> Github to make it happened. But when thinking about it, at least you could use use_compendium, use_mit_license, use_readme_rmd and use_analysis, and also use_testthat without any Github integration.

So surely you are right and we should not make that dependent on Github repo being in existence.

karthik commented 7 years ago

The whole thing is centered around Github

And the plan is to take advantage of all of that. So the most recent project I worked on with Kirill Muller is Travis and Tic (both of which are available as beta release on ropenscilabs). In short, you can set up a recipe for Zenodo (leveraging the Zenodo package) to create a release for software and data at whatever interval (with versioning support). Once you've set it up and authorized a token, it should just run for any project.

benmarwick commented 7 years ago

Thanks Karthik! Are there any of these recipes around for us to take a look at?

We should also consider here:

karthik commented 7 years ago

Hi @benmarwick! There are several recipes to look at now, but none are related to data in particular. But it would be the same logic, and other than S3 class support for a common package (to support data deposition), all the functionality will come from individual packages.

Few recipes to consider:

A tic file for automatic packagedown docs: https://github.com/krlmlr/tic.package

Automatically deploying to drat: https://github.com/krlmlr/tic.drat

A rmarkdown site: https://github.com/krlmlr/tic.website

An automatic bookdown book: https://github.com/krlmlr/tic.bookdown

Great idea to add figshare and dataverse. figshare might be a challenge because they have never prioritized their API and are solely focused on enterprise customers. But we can try.

benmarwick commented 7 years ago

Thanks again, those are useful to see.

Currently I think we want to be pushing our compendium to a data repo independently of our actions on GitHub and Travis.

As @nevrome notes above, making a deposit to a data repo should be a deliberate, infrequent action in the life of a project, so we want it to be separate from the push-to-github-trigger-travis process. I'm imagining this just happens 1-3 times in the life of a project.

My guess is that we could have something like a use_data_repository(what = ".", repo = c("figshare", "dataverse", "osf", "dataverse", "zenodo")) function, which triggers some follow-through steps in the console for the user to confirm some details, and get information to create metadata, before pushing to the data repository. Depending on the repo chosen by the user, we then get the other pkgs to do the heavy lifting with the repo API.

What does everyone think?

benmarwick commented 6 years ago

I was just reminded by @steko about https://frictionlessdata.io/ and pkgs https://github.com/ropenscilabs/datapkg and https://github.com/christophergandrud/dpmr

These look neat, though I've not seen them is use in the wild anywhere, and their download stats are modest.

Has anyone else come across these in the research literature? Worth mentioning in the readme?

nevrome commented 6 years ago

Never seen this. But it seems to be a thing. So why not?

benmarwick commented 6 years ago

Bookmarking this rOpenSci discussion that just appeared: https://github.com/ropenscilabs/doidata/issues/1 Seems like they might be about to develop a pkg that will answer many of our needs here. Some discussion on twitter at https://twitter.com/noamross/status/948340525492555776

Hopefully that pkg will contain a function to deposit data and obtain a DOI (to a variety of repositories), although I guess that task might be much more complex than getting data using a DOI.

noamross commented 6 years ago

@benmarwick Right now we're just thinking about downloading the data given a DOI

januz commented 5 years ago

Has there been progress on this front? What are your current recommendations for getting a DOI associated with the state of a compendium when the corresponding manuscript was submitted/revised/published? Thanks!

benmarwick commented 5 years ago

We haven't seen any recent developments that have made automating this step simpler or more obvious to implement, I mean there is so much variation in current practice it's hard to know what defaults make the most sense.

My current recommendations are to use a hook provided by the data repository service (e.g. Zenodo, OSF, Figshare have this) to connect to the GitHub repo with the compendium, and then make a snapshot a version of the GH repo on the data repo at key points (in OSF this is called 'registering' or freezing a version of the repo). I usually snapshot the repo at the point of submission to the journal, and get the DOI of the repo to include in the text of the manuscript. Then snapshot it again after peer review, and again after final acceptance. The DOI stays the same throughout this process, on OSF at least, and any user can see that the data repository has multiple versions and can browse them easily. The repo versions can be tagged with keywords to indicate what part of the process they relate to.

This all happens outside of R. And for me, at least, it's something I do infrequently, just a few times per year. So it's not urgent for me to automate or highly streamline these steps at the moment. But I'm keen to know more about how others might imagine how these steps could be incorporated into a function!

januz commented 5 years ago

@benmarwick Thank you so much for your detailed explanations. I agree that this process is probably something that can/should be done deliberately and "manually".

benmarwick commented 5 years ago

Yes, for now manual handling seems like the best option for this step, at least as far as I can see. I'm curious to see what might pop up in the future to change my mind!

januz commented 5 years ago

I usually snapshot the repo at the point of submission to the journal, and get the DOI of the repo to include in the text of the manuscript. Then snapshot it again after peer review, and again after final acceptance. The DOI stays the same throughout this process, on OSF at least, and any user can see that the data repository has multiple versions and can browse them easily

@benmarwick Sorry to follow up so late, but I just tested out registration/freezing of an OSF project with associated GitHub repository. From what I can see, the DOI of different registrations is not the same. Instead, the project has a fixed DOI and each registration have different ones.

I understood you in the way that you publish the DOI of the first registration that you create. Did you mean that you share the project's DOI (which stays the same) and people can then navigate to the "Registrations" tab and see the different registrations/snapshots that exist for the project?

Thanks!

benmarwick commented 4 years ago

Let's include the approaches discussed here in an informational final step in the readme to suggest how the user can archive their compendiumn on a data repo of their choosing, cf. https://github.com/benmarwick/rrtools/issues/56