FlukeAndFeather / openbiologging

https://flukeandfeather.github.io/openbiologging/
2 stars 3 forks source link

Hierarchical scoring #7

Open FlukeAndFeather opened 11 months ago

FlukeAndFeather commented 11 months ago

@DomRoche you suggested dividing our scoring rubric into multiple hierarchical levels, perhaps something like this:

Can you expand on that idea here? Thanks!

DomRoche commented 11 months ago

Sure!

Basically, some variables can be scored for all papers (e.g., presence/absence of a DAS, open data, and/or open code; taxa, discipline, device type, region, etc) but others such as data FAIRness and completeness (i.e., are all the data needed to reproduce the results in the paper present in the archived dataset) are much more challenging to score and could be assessed for a representative sample of papers to keep things manageable.

I think it would be interesting to see if data sharing ptactices vary among disciplines (e.g., comparative physiology, biomechanics, behavioural ecology, conservation, etc), study systems/taxa, and regions of the world (either study location [probably easier] or location of the author(s)' institution(s)). Could make for some nice data viz to go in the paper!

RE Were data shared according to FAIR principles? This can be challenging to assess (although now I see that your refer to specific subsets of the FAIR principles in the rubric below). Findability and Accessibility are pretty straightforward. Interoperability less so, because it relies on file formats and the use of standards and shared ontologies, which will often be lacking in these disciplines - so, as we discussed, a lack of interoperability is often not intentional. Reusability is the most time consuming principle to assess IMO… see Roche et al 2015 PLoS Biol and Roche et al 2022 PRSB. You could check for a license and limit yourself to that but comprehensively assessing reusability requires examining whether metadata are present and complete, the file format is non-proprietary, etc and can take hours. Checking metadata completeness is the most challenging.

At the most basic level, a DAS is simply a statement about data accessibility, so "data will be archived on Dryad" counts as a DAS. The combination of DAS = 0/1 and open data = 0/1 is what is informative. For e.g., 30% of papers might have a DAS but only 20% have open data. I would also record if the data are embargoed (and the duration of the embargo) or under controlled access management (i.e. restricted access data). So 4 columns...

It's important to distinguish restricted access data from closed data. There are good reasons why researchers should not make data open - sensitive data should be restricted. For e.g., the location of species that are endangered or at risk of poaching/disturbance, etc. should not be shared. The EU (and now others) use the model "as open as possible, as closed/safeguarded as necessary" when explaining their mandates for open data. For a useful paper on sensitive data, see: Lennox et al 2020 A Novel Framework to Protect Animal Data in a World of Ecosurveillance https://doi.org/10.1093/biosci/biaa035 It would make sense to record if the data are open, under restricted/controlled access, or embargoed (and for how long).

See Table 2 in Roche et al 2015 for a handy scoring system for data completeness and reusability: https://doi.org/10.1371/journal.pbio.1002295. It's been reused in psychology: https://link.springer.com/article/10.3758/s13428-020-01486-1 I think this could be done for a subset of papers (definitely not all) and would require an assessment of inter-rater reliability (i.e. repeatability).

I think it's worth thinking about stats (descriptive and inferential) and data vis before collecting the data. This really helps to male sure the rubric doesn’t miss anything. For e.g., What tables/figures would do you envision to present descriptive stats? What comparisons do we want to make - e.g., differences in DAS/presence of open data/archiving quality or FAIRness of open data among among taxa, types of biologging, world regions, journals, etc? And what would the statistical models look like?