Gardner-BinfLab / deltaBS

Quantifying the significance of genetic variation using probabilistic profile-based methods.
MIT License
17 stars 1 forks source link

Some questions regarding data prep? #6

Open bananabenana opened 1 year ago

bananabenana commented 1 year ago

Hi,

Thanks for developing deltaBS. I am having a bit of trouble prepping my data and was hoping you could clarify a few things.

The background: I have 8k Klebsiella genomes I want to generate DBS values for.

Thanks

nwheeler443 commented 1 year ago

Hi,

Thanks for using deltaBS!

Re: generating the HMM bitscore data quickly, the best approach we've got for running many genomes is just to run the search on each proteome file individually and submit many parallel jobs - it can still be pretty time consuming but this is a general issue with analysing all genes in a large collection of a species with a big pangenome. There's no meaningful difference results-wise between using hmmsearch and hmmscan, so if your genes:models ratio favours the reverse, go for it

Re: choice of HMMs, there are pros and cons to using EggNOG vs custom models - custom models will cover more of your pangenome, but are time consuming in themselves to build. EggNOG models have less coverage but are built using a more sophisticated and curated process, so can potentially be higher quality. If you do want to build your own reference HMM database, using the reference sequence for each gene from Panaroo would be ideal, and yes, you've identified the right script.