data-science-hub / data-science-hub.github.io

Website
4 stars 2 forks source link

Data publishing for data from third-parties that don't allow data sharing #33

Open tkuhn opened 7 years ago

tkuhn commented 7 years ago

How should we deal with studies that use data from third parties like Twitter that don't allow for data sharing? According to the PLOS guidelines (http://journals.plos.org/plosone/s/data-availability), which we are following for now, it seems that such studies couldn't be published (though there are recent PLOS One articles on Twitter studies...). The publication of aggregated, post-processed data (e.g. data points in a plot) should always be possible though. So it seems we have the following options:

  1. Allow exceptions to data availability requirements for cases of third-party data with terms of service that don't allow for data sharing
  2. Interpret data availability requirements in a loose way that is compliant with just releasing aggregated post-processed data (e.g. data points in a plot)
  3. Don't allow for exceptions and apply data requirements in a strict way: We won't be able to publish Twitter studies in this case
  4. Are there other options?

Which one should we follow? I am undecided...

micheldumontier commented 7 years ago

I believe that we need to accommodate the fact that not all data can be shared in a public manner. For instance, we use anonymized patient data in which we are prohibited from sharing, and there are strict restrictions on their availability beyond the approved users, which are in some cases purely members of the medical center. In such cases reproducibility can only be through collaboration, but this cannot be guaranteed owing to the burden that it places on individual investigators. These are real problems that cannot be ignored when it comes to reproducibility. Should we exclude such studies? I don't think so. As we drafted the FAIR principles [1], we specifically recognize that the essential aspect here is that the mechanism by which data can be accessed must simply be made explicit. Therefore, the right solution to this complex real world problem is that there is sufficient documentation that describes the proper mechanism, if any. However, I would argue that if the reviewers raise serious doubts regarding the validity of results and the data cannot be made available to the reviewers, then these are grounds for rejection where agreed by both the managing editor and the assigned editor in chief.

[1] http://www.nature.com/articles/sdata201618