bsmith89 / StrainFacts

Factorize metagenotypes to infer strains and their abundances
MIT License
11 stars 1 forks source link

How to perform model selection? #3

Closed gavinmdouglas closed 2 years ago

gavinmdouglas commented 2 years ago

Hi there,

Apologies if I missed it, but what metric should be used to compare how well models that fit differing numbers of strains compare? For instance, I was hoping to be able to parse something like the AIC or BIC from the model output somehow.

Perhaps this will depend on what model type I specify? See below for an example of how I'm running the command.

sfacts fit \
    --verbose \
    --num-strains 2 \
    --random-seed 0 \
    ./Snodgrassella_alvi_yqjC_metagenotype.loaded \
    ./Snodgrassella_alvi_yqeY_metagenotype.fit

Thanks,

Gavin

bsmith89 commented 2 years ago

Unfortunately, StrainFacts does not provide a default way to do model selection. We have not explored the possibility of using something like BIC, but I believe the major challenge is calculating the effective number of parameters, given that we use regularizing priors (i.e. rho_hyper, pi_hyper, and gamma_hyper). Fortunately, this regularization also means that you don't generally need to specify the "correct" number of strains to get accurate estimates, since extra strains are penalized that might lead to overfitting. So, for some datasets, it's effective to simply run the model with extra strains and see how many are active in the final estimate (analogous to LASSO regression).

On the other hand, this does leave the problem of hyper-parameter selection. I have several ideas about how to evaluate the quality of a fit, but all of them are heuristic and they should probably be considered holistically.

Sorry I don't have a more principled method for you. If your data is not too large, Strain Finder's approach may also be worth checking out (they use AIC). I can also point you at some of the utility scripts that I used for passing metagenotypes to/from the Strain Finder CPickle format when I was benchmarking.

gavinmdouglas commented 2 years ago

Thanks for the information @bsmith89!

I am aiming to run StrainFacts on many different individual genes, so there aren't many SNP positions.

I didn't appreciate that extra strains would be penalized like that though - so would it be reasonable to set the number of strains to the upper limit of what I think could be possible and then filter out strains afterwards based on a reasonable relative abundance cut-off (e.g., required to be at 10% abundance in at least one sample)? Would it be silly to run StrainFacts with (for example) the number of strains set to 50 if in reality there were only 2? Should the penalization procedure be able to correctly handle that kind of situation?

Thanks!

Gavin

bsmith89 commented 2 years ago

I am aiming to run StrainFacts on many different individual genes, so there aren't many SNP positions.

Cool! Definitely not the main use case that SF was designed for, but I'll be very curious how it turns out.

Given the low number of sites, you might also consider comparing the results to another tool like Strain Finder, since my "fuzzy genotype" approximation might lower your power to detect different strains to some extent.

Or maybe, if you're confident your genes of interest are ubiquitous and single-copy in the species, consider concatenating metagenotypes for all of the genes together and running SF on that as well. More SNPs is often helpful.

I didn't appreciate that extra strains would be penalized like that though - so would it be reasonable to set the number of strains to the upper limit of what I think could be possible and then filter out strains afterwards based on a reasonable relative abundance cut-off (e.g., required to be at 10% abundance in at least one sample)?

Yep, this is essentially how I use it. The regularization of strain numbers (instead of model comparison) is especially useful in very large problems where refitting the model would be too time consuming, and this is what SF was designed for. However, it sounds like your problem might be much smaller (fewer SNPs, just a couple of strains, not sure how many samples...). If you're not getting useful results from SF, I'd definitely suggest you take a look at DESMAN.

Should the penalization procedure be able to correctly handle that kind of situation?

In practice, I've found SF is not particularly sensitive to specifying too many strains. Performance seems to be largely determined by sequencing depth and the amount of strain sharing across samples (rare strains are harder to deconvolve from mixed samples).

gavinmdouglas commented 2 years ago

Thanks for the information!