biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
127 stars 33 forks source link

setup phylophlan for computer withoout network access #66

Open EricDeveaud opened 3 years ago

EricDeveaud commented 3 years ago

Hello,

how can I setup the requested phylophlan databases for use on a cluster where compute nodes does not have network access.

I would like to install phylophlan and provide the DBs on a shared folder.

how can I acheieve this task.

regards

Eric

fasnicar commented 3 years ago

Hello Eric, and thanks for using PhyloPhlAn!

You can use the --databases_folder parameter to specify the path where the database(s) are located.

Many thanks, Francesco

EricDeveaud commented 3 years ago

my concern is what to download in order to provide the 2 databases phylophlan and amphora2 let say I want to have the databases hosted on /opt/data/phylophlan/3.02 if I understood correctly I have to download the follwowing file to this directory

https://www.dropbox.com/s/xdqm836d2w22npb/phylophlan_metagenomic.txt https://www.dropbox.com/s/l73jvga66ql4ows/SGB.Dec19.md5 https://www.dropbox.com/s/djm9thsykn9h63s/SGB.Dec19.tar https://www.dropbox.com/s/dw947euykyjeee7/SGB.Dec19.txt.bz2

is that correct ?

regards

Eric

fasnicar commented 3 years ago

Hi Eric,

got it! The links you provided are for the phylophlan_metagenomic and are not the phylophlan and amphora2 databases.

I think the easiest thing to do is to create a fake input folder with 4 genomes in it and run phylophlan twice from a machine with internet connection, the first time specifying the phylophlan database and the second time the amphora2 database. At the beginning of PhyloPhlAn will check and automatically download the database if not present in the --databases_folder.

phylophlan [mandatory_params] -d phylophlan --databases_folder /opt/data/phylophlan/3.02 --verbose
phylophlan [mandatory_params] -d amphora2 --databases_folder /opt/data/phylophlan/3.02 --verbose

Note: You can kill the runs above as soon as the databases are downloaded.

Alternatively, you can download this file: http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan_databases.txt

and then download the two files for each database:

and store them in the folder you want to use for the databases.

Please, let me know if something is not clear.

Many thanks, Francesco

EricDeveaud commented 3 years ago

done that and untared both archive now phylophlan --list-databases show me the DBs

[gensoft@db6b0d05cdf9 inst]$ phylophlan --databases_folder /opt/gensoft/data/phylophlan/3.0.2/ --database_list --diversity high
Available databases in "/opt/gensoft/data/phylophlan/3.0.2/":
    amphora2
    phylophlan

NB having this procedure in the installation instruction would be a plus

alos it would be nice to have DATABASES_FOLDER defined via an environement variable

something like that in phylophlan.py DATABASES_FOLDER = os.environ.get('PHYLOPHLAN_DATABASE_DIR', 'phylophlan_databases')

one can export PHYLOPHLAN_DATABASE_DIR to the location of the databases directory and have phylophlan find the db wiithout having to use the --databases_folder options

what do you think about that ?

alos may I have some information of the reference folder ? (keep in mind I'm not biologist at all, just in charge of the installation and maintenance of software on our cluster, so exuse some silly questions ;-))

can I run phylophlan_get_references -g all -o some_dir and provide those data to our users ? again having an env var would be nnice.

regards

eric

Eric

fasnicar commented 3 years ago

Great!

Yes, I'll add this to the wiki.

About the env variable, I'll implement it in the following release, although I believe that for non-computational people the param is easier to use, I also think it should not be too complicated to have both options working.

About phylophlan_get_references -g all -o some_dir yes, the genomes retrieved from there are all publicly available so no problem in getting them and then allowed the users to access the resource.

Many thanks, Francesco

EricDeveaud commented 3 years ago

hello,

wik may requires 2 mores informations.

bz2 files must be bunzipd and concatenated to respectively phylophlan.faa and amphora2.faa then indexed with diamond.

note, we are providing software and data on read only file system to our users, so downloading db files is not enough, we also need to process them to avoid 'write permission' when they run phylophlan for the first time

so full instruction will be

DB_DIR=/whatever/you/want/to/host/databases
mkdir $DB_DIR
cd $DB_DIR
wget http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan_databases.txt
wget http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/amphora2.tar
wget http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/amphora2.md5
wget http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan.tar 
wget http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan.md5 
tar xf amphora2.tar
tar xf phylophlan.tar
bzcat * >> ampphora2/*.bz2 amphora/amphora2.faa
bzcat * >> phylophlan/*.bz2 phylophlan/phylophlan.faa
diamond makedb '--threads <N> --in $DB_DIR/amphora2/amphora2.faa --db $DB_DIR/amphora2/amphora2
diamond makedb '--threads <N> --in $DB_DIR/phylophlan/phylophlan.faa --db $DB_DIR/phylophlan/phylophlan:w

NB I skipped the md5 check which is pretty obvious ;-)

should the *.bz2 files keept ?

regards

Eric

EricDeveaud commented 3 years ago

one more question... maybee not... ;-) when using phylophlan_setup_db eg phylophlan_setup_database.py -g s__Staphylococcus_aureus should'nt the s__Staphylococcus_aureus db directory be generated by default in phylophlan_databases ?

and why diamond indexation is mnot carried out by phylophlan_setup_db ?

Eric

fasnicar commented 3 years ago

Hi Eric,

I see and you're right if the file system is read-only then one has to perform also the decompression and indexing. I'll add this, thank you. Although an important thing to remember here is that when using diamond, different diamond versions produce different indexed databases that are not compatible. So, one has to ensure that the very same version used for indexing is also used in the config file when running PhyloPhlAn.

For the reason above, phylophlan_setup_database only download and prepare the database but will not carry out the indexing as in that case, it would also require the configuration for the tool to use for indexing, which is something that depends more on phylophlan.

Many thanks, Francesco

EricDeveaud commented 3 years ago

I will provide phylophlan as a module via environnement modules. diamonn version will be fixed in our case (2.6). but I understand the point

but I would say that since diaomnd v0.9.25 to current produce format version 3 and accept format version 2-3 see: https://github.com/bbuchfink/diamond/wiki/5.-Advanced-topics having dependencies fullfilled for diamond >=0.9.25 will solve the problem nowadays last diamond version is 2.10.

regards

Eric