Closed LiNk-NY closed 3 years ago
Why would this be an error?
I actually do this in some of my packages as it can help me access data version controlled in a repo outside the R package.
github data can disappear. There is no control over when an individual deletes this. It should be in a stable, persistent space. It is the same reasoning to only allow Bioconductor dependencies to be from CRAN or Bioconductor.
what if you are the author of both repos and control them both?
Bioconductor has a copy of the software repo. We wouldn't have a copy of the data. It's a matter of persistency. Maybe Error is too strong; maybe warning -- It is definately related to #67 where maybe the hubs should be recommended.
Just as a more detailed example, see https://github.com/leekgroup/recount/blob/master/R/all_metadata.R#L44-L50 and https://github.com/leekgroup/recount/blob/master/R/add_metadata.R#L68-L77 that is related also to https://github.com/Bioconductor/BiocCheck/issues/67
This is coming up in https://github.com/waldronlab/bugsigdbr/issues/19 and IMO, GitHub is the best solution for our use case and ExperimentHub doesn't meet our requirements. The bugsigdb database changes continuously and we want to allow users to reproducibility access data snapshots in two ways:
1) as of a specific date: we have a GitHub Action that updates a data-only GH repo daily, so that it is available as of any date or commit 2) as of a specific Bioconductor release, posted on Zenodo.org as a release of github.com/waldronlab/BugSigDBExports
Is there evidence behind the concern that GitHub data can just disappear at any time? I have yet to hear of an instance of a GitHub repo just disappearing unless it is deleted by the owner, but that is the same for most URLs. It sounds a very hypothetical concern whose implementation introduces immediate practical downsides and limitations. Our approach I think is also more FAIR than the current ExperimentHub approach because we store the data in plain text csv files that are interoperable on any platform and findable e.g. from a Google Dataset search. ExperimentHub could be a (in this case less functional) substitute for our Zenodo.org releases, but it can't currently provide a solution for our desired daily snapshot feature. The only non-GitHub approaches I know of would be to move to a less robust on-campus server or to commercial Cloud services that would become unavailable as soon as a bill isn't paid, both of which add an extra step to the workflow while seeming to me equivalent or inferior to GitHub.
zenodo is considered an acceptable option in this case. We don't discourage use of zenodo and package data access through there. We still would recommend against storing data in github not only with the concern of users deleting data and therefore any package and previous analysis would be non reproducible but also as a level of control against any malicious and harmful downloads. It is also the though that using trusted servers for data access would make it safer from that perspective as well.
https://raw.githubusercontent.com/ or other flagged external sources