MrOlm / inStrain

Bioinformatics program inStrain
MIT License
137 stars 33 forks source link

Optimal dRep %; run dRep on both isolates and MAGs? #143

Closed liamfitzstevens closed 1 year ago

liamfitzstevens commented 1 year ago

Hi Matt,

I have a collection of genomes of human gut isolates from the same genus. I'd like to determine whether these strains are present in the metagenomes of related hosts. inStrain is, of course, perfect for this analysis.

1) The inStrain docs recommend a maximum dereplication percentage of 98%. Is this still the case (i.e., should I definitely not go higher)? I only ask because after isolating a bunch of strains (of which there is a lot of subspecies overlap), it seems a shame to dereplicate so many out of the bioinformatic analysis. But it seems this is the way - do you agree?

2) I also have MAGS; many species from which we already have isolate genomes (the latter of which are higher quality), but also for species of the same genus-of-interest for which we don't have isolates. Would it make sense to dereplicate all of my MAGs and isolate genomes together? At 98%, dRep would likely select the isolate genomes over the MAGs when there are both for a given subspecies, and pick the best MAGs for the MAG-only subspecies. In other words, this would be constructing a representative genome database comprised of both isolates genomes and MAGs. Do you think this would be the best workflow?

Best, Liam

MrOlm commented 1 year ago

Hi Liam,

Nice hearing from you.

1) There are a couple of options here. You should not make the dRep percent more stringent than 98% because bad things happen. If you want to determine which strains are present in a sample by mapping, you could map to each genome individually and then run inStrain on each mapping. This is obviously much more compute, which is why it's not recommended. If you do this, though, you could look into the popANI and the conANI of the genomes being mapped to. The other strategy of mapping to a 98% representative allows you to then run inStrain compare to determine which samples share strains. I understand this can be a confusing difference- let me know if you have follow up questions.

2) Yes, what you describe is what I would do. dRep also has an option to give extra weight to a specific set of genomes (in your case isolate genomes) to ensure that they're always picked over MAGS. I would do that just to make sure you always pick you isolate genomes over MAGs. Let me know if you have questions about this.

Best, Matt

liamfitzstevens commented 1 year ago

Thanks Matt! I will make inStrain profiles with sub-speciesRGs (i.e., the 98% approach) and use compare.