to include or remove plasmids in reference genome databases?

Hello,

Yeah, this is a very interesting question. This really comes down to the main problem of using reference genomes like this, which is that you loose the accessory gene content of other genomes in the species that aren't in your reference genome. Unfortunately this is currently just a negative of using this method, and hopefully in the coming years better methods will be developed to include accessory genes in the reference database (somehow).

There are things you can do to address this now, however, depending on how important the accessory genes are to you and how much you want to mess around with this.

If you know there are particular plasmids / genes of interest that are not in your reference database and you would like profiled, you can always just add them to your database and use a tool like ScaffoldLevel_dRep.py (https://github.com/MrOlm/drep/tree/master/helper_scripts) to make sure that they're distinct enough from the rest of the sequences in your database that they won't steal reads.

Additionally, you could imaging using your current reference collection as a "first pass" to identify what pathogens are present, and when pathogen are detected, follow up with an additional species-specific database that includes all of the genes of interest of that particular pathogen.

Hope this is helpful! -Matt

MrOlm / inStrain

to include or remove plasmids in reference genome databases? #123