Innovation-Sprint-2021 / navigating-scientific-code

"Understanding Repository Structures Through Charts: A GitHub Navigation Tool"
Apache License 2.0
2 stars 1 forks source link

Documentation checklist for scientific code repositories #2

Open krassowski opened 3 years ago

krassowski commented 3 years ago

I wonder if anyone came across a checklist describing how to prepare a code repository before sharing it in a paper? I know of the Ten simple rules for documenting scientific software list which touches on some good practices which could reduce the problem of being unable to gather what is happening in others code, but it is oriented towards re-usable software, while a lot of the worst examples of repositories are for the papers where the author does not expect their code to be re-used (i.e. it is there only to document that they did an analysis/performed a simulation, etc.).

Certainly https://the-turing-way.netlify.app/ made a lot of effort to make research reproducible and encourage minimal reasonable practices, such as file naming, linting, and importantly repository organization.

Do you know of other resources targeted at researchers sharing their small software/analysis code which would encourage best practices such as:

mstimberg commented 3 years ago

The codecheck project has a quite formalized process with a so-called manifest file: https://codecheck.org.uk/guide/community-workflow#requirements

Here's a checklist for machine learning papers: https://medium.com/paperswithcode/ml-code-completeness-checklist-e9127b168501

And here's a guide specific to Python: https://docs.python-guide.org/writing/structure/

krassowski commented 3 years ago

Thank you! The Papers with Code checklist and template are amazing, it really makes it easy to find the relevant files and re-run the code. The codecheck reminded me of three software journals/collections:

krassowski commented 3 years ago

The important thing is that the repositories shared via ROpenSci/pyOpenSci/JOSS are likely a bit easier better built (aimed at re-use by others, as this is how those get citations) and are usually used by the researchers with sufficient background to get by and find their way around weirdly/inconveniently structured repo (e.g. where GitHub search fail they can clone and grep easily).

The bigger problem are repositories with analysis papers where the target audience is a PhD student/postdoc who often only knows one (statistical) programming language and maybe even only to a degree allowing to do their analysis, but not necessarily to understand someone else's code, or wrap their head around the current software dev practices/tooling (I know a lot of excellent scientists who are like that).

There is also a group of repositories which lays in between a re-usable code and analysis code. I call it MatLab, but really in covers other languages too; these are specialised languages which are unlikely to be re-used by researchers outside of departments having relevant licence and expertise, yet the authors seem to think that many researchers will re-use them (but oftentimes do not document them sufficiently). I believe there is a small group of methodologists who will in fact use that (MatLab or other) code and a larger group who just want to use it as a basis for re-implementing/understanding the algorithm by analysing of the code (for which MatLab is often a good choice!).