biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
128 stars 33 forks source link

Using manually downloaded database #18

Closed alegione closed 4 years ago

alegione commented 4 years ago

Thanks for the great looking tool. My question/issue relates to phylophlan metagenomic: Is there a simple way to download, extract, and point to the database manually?

Have spent the last day trying to get Phylophlan_metagenomic working but keep getting stuck with the database.

My cloud instance can't seem to complete the download from within the program without having the occasional connectivity drop and the download breaking, so I just keep having to restart and hope (so far to no avail)

I can easily manually download the .tar file with wget -c to avoid issues of connectivity loss, but then can't seem to find a way for the tool to see that the database exists

I've tried the following phylophlan_metagenomic -i myfolder -o output-folder --nproc 8 -d SGB.DEC19 --database_folder place/with/the/database/

and get the following error [e] invalid number of URLs for "SGB.DEC19" in the downloaded file

Looking at the code, I can see a check for whether the database exists, or if the md5 exists

    if (    not os.path.exists(os.path.join(args.database_folder, args.database)) or
                not os.path.exists(os.path.join(args.database_folder, args.mapping)) or
                not os.path.exists(os.path.join(args.database_folder, args.database + '.md5'))    )

both should be true (though technically the file is .tar so not sure if that would return true), but the program still runs the URL check and fails. Is there a means of downloading the database manually, set it up, and running the tool without having it try to download everything again?

I'm sure I'm missing something obvious, but just can't work it out

fasnicar commented 4 years ago

Dear Alistair Legione,

Many thanks for reporting this.

One thing I noticed is that the database ID is not correct in your example, it should be SGB.Dec19 (you can obtain the list of the available databases with the --database_list param).

So, I assume that the file phylophlan_metagenomic.txt has been successfully downloaded, right? If not you can download it using this URL:

https://www.dropbox.com/s/xdqm836d2w22npb/phylophlan_metagenomic.txt

You can get the 3 URLs f from this file for the SGB.Dec19 database, which are:

$ grep SGB.Dec19 phylophlan_metagenomic.txt 
https://www.dropbox.com/s/l73jvga66ql4ows/SGB.Dec19.md5?dl=1    SGB.Dec19.md5
https://www.dropbox.com/s/djm9thsykn9h63s/SGB.Dec19.tar?dl=1    SGB.Dec19.tar
https://www.dropbox.com/s/dw947euykyjeee7/SGB.Dec19.txt.bz2?dl=1        SGB.Dec19.txt.bz2

You can manually download all 3 files with wget (please remember to remove the ?dl=1 from the URL) and save them into a folder of your choice.

Assuming now you downloaded the 3 files for the SGB.Dec19 release into the db folder, you can run:

phylophlan_metagenomic -i input_folder -d SGB.Dec19 --database_folder db/

and phylophlan_metagenomic should correctly detect that the database files are already available inside the db folder and run without downloading anything else.

Please let me know if you should find any issues with these steps.

Many thanks, Francesco

alegione commented 4 years ago

Thanks Francesco, such a rookie error!

Going back through my history it seems I'd originally had the database name correct, but had only downloaded the tarball and not the txt file (and was getting an error on not having 3 urls from recollection), somewhere between downloading the txt file and retyping the command I'd switched to all caps for the database name (slap forehead)

Have fixed the spelling of the database and at the moment haven't encountered an error when using the same command structure as my earlier command. Looking forward to seeing the results

Thanks for picking up my problem!

fasnicar commented 4 years ago

Super, glad it helped. I'll close the issue then.

ganiatgithub commented 4 years ago

Hi Francesco,

I'm experiencing similar issue with the program trying to download the database files at every run.

I have tried to use --databases_folder in my command, but it still seems to start with downloading, and about 8 out of 10 times, the download would fail probably because of connection. Here is my command. I have copied the phylophlan database to a location, specified by --databases_folder

phylophlan \ --input_folder ./faa \ -o ./out \ --nproc 48 \ --diversity low \ -d phylophlan \ --databases_folder /home/Staff/uqgni1/tools/phylophlan/database/phylophlan \ -f /home/Staff/uqgni1/tools/phylophlan/phylophlan2_configs/protein-tree-updated.cfg \ --configs_folder /home/Staff/uqgni1/tools/phylophlan/phylophlan2_configs \ --submat_folder /home/Staff/uqgni1/tools/phylophlan/phylophlan2_substitution_matrices \ --maas /home/Staff/uqgni1/tools/phylophlan/phylophlan2_substitution_models/phylophlan.tsv \ -i wgt_v6

Forgive me for using some pp2 config files, they are working and I dare not to change them. But I'm happy to hear your suggestion though.

The last lines of the error message reads:

Downloading file of size: 0.00 MB 0.01 MB 2685.90 % 7.05 MB/sec 0 min -0 sec Downloading file of size: 64.05 MB [e] unable to download "https://www.dropbox.com/s/0h8ugr8hse4zmei/phylophlan.tar?dl=1"

What I want in the end is to tell phylophlan to use the database files I already downloaded.

Kind regards, Gaofeng