MrOlm / inStrain

Bioinformatics program inStrain
MIT License
145 stars 33 forks source link

to include or remove plasmids in reference genome databases? #123

Closed brittanysuttner closed 1 year ago

brittanysuttner commented 1 year ago

Hello, I am creating my own reference for strain typing of pathogens with InStrain. I have a dereplicated reference collection of complete RefSeq genomes of interest, however, a lot of these genomes have plasmids and I am wondering how I should handle this. I think some of the variant specific content (e.g. virulence genes in E. coli) can be plasmid-encoded, thus, selecting a random ref genome from a species and excluding the plasmids may cause us to inaccurately type the strains (e.g. just selecting E. coli K12 as the species representative)? Just curious if you have any thoughts about this.

MrOlm commented 1 year ago

Hello,

Yeah, this is a very interesting question. This really comes down to the main problem of using reference genomes like this, which is that you loose the accessory gene content of other genomes in the species that aren't in your reference genome. Unfortunately this is currently just a negative of using this method, and hopefully in the coming years better methods will be developed to include accessory genes in the reference database (somehow).

There are things you can do to address this now, however, depending on how important the accessory genes are to you and how much you want to mess around with this.

If you know there are particular plasmids / genes of interest that are not in your reference database and you would like profiled, you can always just add them to your database and use a tool like ScaffoldLevel_dRep.py (https://github.com/MrOlm/drep/tree/master/helper_scripts) to make sure that they're distinct enough from the rest of the sequences in your database that they won't steal reads.

Additionally, you could imaging using your current reference collection as a "first pass" to identify what pathogens are present, and when pathogen are detected, follow up with an additional species-specific database that includes all of the genes of interest of that particular pathogen.

Hope this is helpful! -Matt