Open AstrobioMike opened 2 weeks ago
- or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation...
Yep! This is precisely what the f_unique_weighted
and n_unique_weighted_found
columns provide from sourmash gather
output. f_unique_weighted
is an estimate of the proportion of bases in the total read data set that will map; n_unique_weighted_found
is an estimate of the number of bases that will map. And, in case you want it, f_match
is the fraction of the MAG that will be covered by mapped reads (aka "detection"). See Irber et al., 2022 for science and algorithms, and sourmash docs for column details. And if you want friendlier text, read from here on down in the FAQ ;).
As you intuit, they are likely a lower bound when using k=21 or k=31; mapping will be a bit more flexible in its matching.
One convenience, should you wish it - if the MAGs have not been dereplicated, you can still use the results just fine. sourmash will not double count reads/bases; see https://github.com/sourmash-bio/sourmash/issues/3188 for more discussion.
Another convenience is that you can take the output of sourmash gather
and convert it into a variety of taxonomic reports with sourmash tax metagenome
; lmk if you'd like to hear more.
Last but by no means least, we now have multithreaded versions of gather that are wicked fast - sourmash scripts fastgather
and sourmash scripts fastmultigather
, from the branchwater plugin.
mamba create -y -n smash 'sourmash_plugin_branchwater>=0.9.5'
mamba activate smash
sourmash sketch mags*.fasta -p k=31 -o MAGs.sig.zip
sourmash sketch metagenome-R1.fq.gz metagenome-R2.fq.gz -p k=31,abund -o metagenome.sig.zip
sourmash scripts fastgather -c 32 metagenome.sig.zip MAGs.sig.zip -o metag.x.MAGs.csv
and then inspect metag.x.MAGs.csv
for the columns above.
HTH, ask questions as you have them!
You rock, @ctb! Thanks for the overview, details, and quick code example! I should randomly ping you more often :)