Open elximo opened 1 year ago
Hey @elximo, thanks for your patience and useful bug report.
I'm taking a look at your error now. evaluate_fit
isn't the most carefully designed sub-command, so I may have assumed that samples were indexed by integers, as in the simulation. If that's it, the fix should be pretty straight-forward, and I'll push a patch.
Stay tuned.
Okay, I think I may have fixed your bug. Please try running your example again with the most recent commit, and let me know if it works. If not, I'll be happy to do some more digging with your minimal example in hand.
You also asked:
I need some help getting the evaluation metrics generated. Since I have no ground truth, what should be the correct way to obtain them?
Evaluate fit has only one accuracy index that can be run without ground-truth. It's what I've been calling the "metagenotype error", and is effectively the absolute difference between the observed number of counts for each allele and the expected number. You can find the code for that function, here. Some types of fitting problems will show up as large metagenotype errors (both overall and within individual samples).
The second approach is less formal, and involves visual inspection of the metagenotypes and inferences. Take a look at the evaluation example notebook for some ideas about how you might do that. Unfortunately I haven't yet found any one-size-fits-all way to assess model performance or tuning, and manual inspection is still my standard approach.
While I have yet to update the default settings, I have recently been getting the most consistent performance with the following fitting parameters:
python3 -m sfacts fit \
--model-structure model4
--num-strains <S> \
--hyperparameters gamma_hyper=1e-15 pi_hyper=0.01 pi_hyper2=0.01 rho_hyper=1.0 rho_hyper2=1.0
--anneal-hyperparameters gamma_hyper=0.999
--anneal-steps 120000
--optimizer-learning-rate 0.05
--min-optimizer-learning-rate 1e-2
-- <INPATH> <OUTPATH>
With --num-strains
equal to about double the number of strains you expect.
If you are fitting on a CPU, you might need to reduce the number of annealing steps. (Try 10,000 to start, and increase the number of steps if it is fast enough.)
Hope this helps!
Hello Byron,
Thanks for your help with the loading and fitting the data. I need some help getting the evaluation metrics generated. Since I have no ground truth, what should be the correct way to obtain them? I am using the same dataset I shared with you in #7
When I run the following command
sfacts evaluate_fit --outpath SampleM.eval_all_fits.tsv SampleM.filt.mgen.nc SampleM.filt.ss-0.fit.world.nc SampleM.filt.ss-0.fit2.world.nc SampleM.filt.fit3.world.nc
I got the following error
When I ran the same commands as in the evaluation notebook
I get exactly the same exception with a more elaborate trace
Since the exception is not triggered from your code then it is likely I am doing something wrong with the choice of the files or I should have used a different function other than sim.sel(). Could you please point me to the right function to use (or the right
sfacts evaluate_fit
command to run) for my case? This is greatly appreciated.