biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
128 stars 33 forks source link

The download of reference genomes #84

Open lipumpkin opened 2 years ago

lipumpkin commented 2 years ago

Hi, professor fasnicar Now i have a question about the option -g in phylophlan_get_reference. I downloaded ref genomes for genus Acinetobacter by this command (phylophlan_get_reference -g g__Acinetobacter -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log). And i got 227 genomes of this genus finally. The txt(assembly_summary_genbank.txt) shows that over 10,000 species belong to genus Acinetobacter. And then I tried other command (-n 300), but i got 806 genomes finally. On what basis were these 227 or 806 species selected? And did they include all child taxa (species) with a validly published of the genus?
Thanks

fasnicar commented 2 years ago

Hi, the -n parameter is an "up to" for each single species. To make an example, let's assume you specify (as you reported above):

phylophlan_get_reference -g g__Acinetobacter -o input_genomes/ -n 5

then up to 5 genomes for each species listed under g__Acinetobacter will be downloaded. Now, again for the sake of the example, assume that there are only 3 species followed by the number of available genomes:

g__Acinetobacter|s__species_1    3
g__Acinetobacter|s__species_2    15
g__Acinetobacter|s__species_3    6

In total, you have that there are 24 genomes, but you end up downloading 13 since s__species_1 only have 3 genomes.

Now, if you check phylophlan_get_reference -l | grep "g__Acinetobacter" | less -S you'll find:

k__Bacteria|p__Proteobacteria|[..]|f__Moraxellaceae|g__Acinetobacter       227     2984

The above means that there are 227 species listed under g__Acinetobacter and in total there are 2984 genomes that can be retrieved. So, it makes sense that you downloaded 227 genomes with -n 1 and 806 with -n 300 As there is s__Acinetobacter_baumannii with 2478 genomes.

I hope this helps.

Thanks, Francesco

lipumpkin commented 2 years ago

Hi, thank you very much.

I have fully understand the meaning of the -n parameter. There is no doubt that your answers help me understand this code better.

Thanks, Zikun