bluenote-1577 / sylph

ultrafast taxonomic profiling and genome querying for metagenomic samples by abundance-corrected minhash.
MIT License
185 stars 6 forks source link

Integrating results generated from pre-built databases across domains #11

Closed bcpd closed 4 months ago

bcpd commented 5 months ago

Hi -

I'm interested in obtaining estimates of relative abundance across all domains of life. Is this possible with sylph? If so, it's unclear to me if that would entail concatenation of pre-built databases or whether a simple merge (after sylph_to_taxprof.py) would be sufficient.

Thanks very much.

bluenote-1577 commented 5 months ago

Hi @bcpd,

You could conceivably do this with sylph. For example, if you wanted to profile both eukaryotes and prokaryotes, the correct way to do this is to simply do

sylph profile eukaryotes.syldb prokaryotes.syldb (samples)

This concatenates the databases. If you were to do a merge, the relative abundances wouldn't track (e.g. eukaryote species A may have 50% abundance compared to other eukaryotes, but only 1% abundance across all bacterial+euk species).

Let me know if you have any other questions,

Jim

bcpd commented 4 months ago

Excellent - thank you! A related question: for taxonomic profiling, would sylph_to_taxprof.py work with multiple metadata files?

bluenote-1577 commented 4 months ago

@bcpd sylph_to_taxprof.py would not work with multiple metadata files, but you can do

zcat metadata_file1.tsv.gz metadata_file2.tsv.gz ... > all_metadata_file.tsv

and the all_metadata_file.tsv should work. Basically, the metadata file is just a 2-column file indicating the mapping of genome name to taxonomy string like "dbacteria;p....", see https://github.com/bluenote-1577/sylph/wiki/Integrating-taxonomic-information-with-sylph#custom-taxonomies-and-how-it-works

bcpd commented 4 months ago

Great -- thanks very much.

bluenote-1577 commented 1 week ago

@bcpd

sylph_to_taxprof.py works with multiple metadata files now. you can do sylph_to_taxprof.py -m file1.tsv.gz file2.tsv.gz

I forgot when I added this change, but I'm adding this comment in now so future readers will not be confused.