Open rlorigro opened 1 month ago
Changed title because just 2 parameters might be simple enough with a regular calibration curve instead of using Optuna. Third param is somewhat optional, if we don't consider run time. Opinions?
Hmm, this might be small and interpretable enough that a rough grid search might be warranted instead? There will be some development overhead and a bit of a learning curve required to get the Optuna notebook up and running. But if you want to take a look, I think the Malaria_optuna_test.ipynb
notebook in the workspace I just shared with you (https://app.terra.bio/#workspaces/broad-firecloud-dsde/malaria-filtering-optimization-staging_monica%20copy) might be the place to start.
And just to be clear, the approach here is to run Optuna within a notebook that launches your WDL with appropriate parameters. So you could even use a similar approach to do the grid search.
yea on second thought, I can just make a Terra table with all the parameters as rows and run them. Will probably be easier than using the notebook
Made some progress on this, using Fabio's latest addition to Hapestry which includes small vars. More tests will be needed at the 1074 sample scale, but this is where we are at now:
all results are computed with vcfdist --realign-etc
over the chr1:100Mbp-110Mbp region using 47 hprc samples
I've decided to accept the polynomial difference-from-best weighting scheme because it appears to be more stable and reach a higher maximum. Unfortunately Precision and Recall in this context are not very meaningful, so we are just using F1 to guide our decisions.
We are starting to accumulate some parameters in hapestry, which could benefit from parameter tuning (as with Optuna).
To start with, these seem like good candidates (in order of priority):
d
term (alignment distance function) as opposed to then
term (number of haplotypes used in the solution)As an objective function for Optuna, the F1 of vcfdist could be used, either directly on the VCF or on the final output of the phasing pipeline.