a - Githubissues

Data analysis review checklist

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

The project is very well done with an excellent background introduction, a very well-thought research approach and a beautifully written analysis report.
I particularly like the fact that the threshold was selected in order to maximize recall and keep the false negatives low, since the false negatives are definitely very harmful in the prediction of cervical cancer!
The main function is a little bit long in the model_training.py script. It might be better to split up this function into several small functions with one corresponding to each model, and make calls to those functions in the main(). Another way to approach this could be creating a separate script for the training of each model. There is nothing wrong with the way it is laid out currently, and my suggestion is only for the purpose of improving readability.
There are some error messages in the cervical_cancer_data_eda.ipynb file, where it says background_gradient requires matplotlib, potentially due to matplotlib not being imported.
It might be good to have some subdirectories in the results folder to keep the files more organized, e.g., one for PR curve files, one for threshold files, etc.

This was derived from the JOSE review checklist and the ROpenSci review checklist.