AstrobioMike / bit

Bioinformatics Tools
GNU General Public License v3.0
81 stars 10 forks source link

[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample #25

Open AstrobioMike opened 2 weeks ago

AstrobioMike commented 2 weeks ago
ctb commented 2 weeks ago
  • or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation...

Yep! This is precisely what the f_unique_weighted and n_unique_weighted_found columns provide from sourmash gather output. f_unique_weighted is an estimate of the proportion of bases in the total read data set that will map; n_unique_weighted_found is an estimate of the number of bases that will map. And, in case you want it, f_match is the fraction of the MAG that will be covered by mapped reads (aka "detection"). See Irber et al., 2022 for science and algorithms, and sourmash docs for column details. And if you want friendlier text, read from here on down in the FAQ ;).

As you intuit, they are likely a lower bound when using k=21 or k=31; mapping will be a bit more flexible in its matching.

One convenience, should you wish it - if the MAGs have not been dereplicated, you can still use the results just fine. sourmash will not double count reads/bases; see https://github.com/sourmash-bio/sourmash/issues/3188 for more discussion.

Another convenience is that you can take the output of sourmash gather and convert it into a variety of taxonomic reports with sourmash tax metagenome; lmk if you'd like to hear more.

Last but by no means least, we now have multithreaded versions of gather that are wicked fast - sourmash scripts fastgather and sourmash scripts fastmultigather, from the branchwater plugin.

some example commands

mamba create -y -n smash 'sourmash_plugin_branchwater>=0.9.5'
mamba activate smash

sourmash sketch mags*.fasta -p k=31 -o MAGs.sig.zip
sourmash sketch metagenome-R1.fq.gz metagenome-R2.fq.gz -p k=31,abund -o metagenome.sig.zip

sourmash scripts fastgather -c 32 metagenome.sig.zip MAGs.sig.zip -o metag.x.MAGs.csv

and then inspect metag.x.MAGs.csv for the columns above.

HTH, ask questions as you have them!

AstrobioMike commented 2 weeks ago

You rock, @ctb! Thanks for the overview, details, and quick code example! I should randomly ping you more often :)