GenoML / genoml2

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data
Apache License 2.0
27 stars 17 forks source link

Adding functionality to test additional datasets on tuned models #12

Closed m-makarious closed 3 years ago

m-makarious commented 4 years ago

Please make sure that this is a feature request.

System information:

Describe Current Behavior/State and Recommended Feature Request: Currently, to test additional datasets, harmonization and testing only takes in a trained (not tuned) model as input. We should include a way to test the tuned model on additional incoming datasets.

This is a pretty no-brainer feature to include, just didn't have the time to figure out the issue before the first release.

Will this change the current API? How? Yup, will add additional flags during tuning so that only the harmonized columns are being used prior to the tuning step.

Who Will Benefit from this Feature? I think everyone?

Any Additional Information?

m-makarious commented 3 years ago

The README has been updated to reflect the steps on how to test a trained and subsequently tuned model.

Briefly, after munging, training, harmonizing on your test data, and retraining based on your test dataset features, you can tune your model based on those same test dataset features.

The commands would look like the following:

# Tuning the retrained model using the final harmonized columns: 

genoml discrete supervised tune \
--prefix outputs/test_discrete_geno \
--matching_columns outputs/validation_test_discrete_geno.finalHarmonizedCols_toKeep.txt

# Testing the tuned model on unseen dataset (changed suffix from .trainedModel to .tunedModel):
genoml discrete supervised test \
--prefix outputs/validation_test_discrete_geno \
--test_prefix outputs/validation_test_discrete_geno \
--ref_model_prefix outputs/test_discrete_geno.tunedModel

Proceed with caution - this might still be a bit buggy, so looking for feedback!