cultivarium / GenomeSPOT

Predict oxygen, temperature, salinity, and pH preferences of bacteria and archaea from a genome
https://cultivarium.org/
MIT License
27 stars 1 forks source link

Train and Evaluate Predictive Models #2

Closed tylerbarnum closed 3 months ago

tylerbarnum commented 3 months ago

The previous PR enabled use of models. This PR enables reproduction of methods and analyses in the paper.

The directory genomic_spot/model_training contains the vast majority of new code. You should read the descriptions in the README before reviewing. However, testing with the code provided in the README will be difficult, as the scripts take a very long time to run. A final PR for unit tests is forthcoming; the work could also be requested as part of this PR.

The directory notebooks now contains all notebooks used in the analyses. I imagine Alex may want to scan.

Other changes:

tylerbarnum commented 3 months ago

I updated the setup.py to require python 3.8.16 because I was an issue importing biopython like from Bio import that could only be resolved by the python version.

I've added tests for the main genome_spot module, the bioinformatics submodule, the taxonomy submodule (which uses taxonomy to balance and split datasets), and the key functions in the model_training submodule. I felt it is OK to have low test coverage in model_training because the key outputs of model training are tested in other spots (e.g. measuring genomes and using the output models for predictions are tested elsewhere) and because the submodule will only be used when models are updated, which only the most advanced users will do.

knightjdr commented 3 months ago

The year in the license should be updated to 2024.

tylerbarnum commented 3 months ago

The year in the license should be updated to 2024.

Why do you want to hurt my feelings?

tylerbarnum commented 3 months ago

I addressed all comments.

I scanned through the linting errors. Most of concern are sklearn issues that I don't know how to address well, but it works this sklearn version and we've pinned the version, so it should be OK. In many cases I decided to not make a change, e.g. where there were missing docstrings I think the function name is sufficient (e.g. parse_args).