biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
128 stars 33 forks source link

database of phylophlan 3.0 #19

Closed wangpeng407 closed 4 years ago

wangpeng407 commented 4 years ago

Dear Francesco Thanks for the brilliant tool. When running the test command phylophlan_setup_database -g s__Staphylococcus_aureus -o 01_saureus --verbose, it sometimes works wrong due to the bad or unstable network speed, warnings like this:

Downloading "http://www.uniprot.org/uniref/UniRef90_Q5HNU0.fasta" to "./s__Staphylococcus_aureus/Q5HNU0.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_Q5HNU0.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_Q2FHD9.fasta" to "./s__Staphylococcus_aureus/Q2FHD9.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_Q2FHD9.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A7X5K8.fasta" to "./s__Staphylococcus_aureus/A7X5K8.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A7X5K8.fasta"
Downloading "http://www.uniprot.org/uniref/UniRef90_A0A181DVE6.fasta" to "./s__Staphylococcus_aureus/A0A181DVE6.faa"
[e] unable to download "http://www.uniprot.org/uniref/UniRef90_A0A181DVE6.fasta"
......

So is it possible to modify the script to extract sequences from the dowloaded ”uniref90.fasta.gz“ according to uniref ID? If possible, I think it is more convenient and user-friendly.

Thank you very much.

fasnicar commented 4 years ago

Hi,

Thanks for reporting this. The errors you're seeing (at least some of them) might not be due to bad/unstable connection, but might be due to the fact that the IDs changed in UniRef and hence those IDs cannot be downloaded directly. Those errors are stored by phylophlan_setup_database and will be re-tried right after by using the Uniref's APIs to resolve the UniRef90 IDs into the newer ones. If also the second attempt to download the UniRef90s failed for some of the IDs, you'll find in the output folder a file named <taxonomic_label>_core_proteins_not_mapped.txt which lists the UniRef IDs that were not possible to download.

Many thanks, Francesco