carolzhou / multiPhATE2

multiPhATE with comparative genomics
18 stars 10 forks source link

multiPhATE2 program in a multi-user environment #41

Closed ricmedveterinario closed 1 year ago

ricmedveterinario commented 2 years ago

Hi @carolzhou,

We are using the multiPhATE2 program in a multi-user environment, a university cluster,

We encountered some installation issues, which makes installation difficult.

We traced the dbPrep_getDBs.py script simulating what should happen. The script fails to download 2 files: "all.fna.tar.gz" (from ncbi virus genome) and "all.faa.tar.gz" (from ncbi virtus protein). We're not sure, but apparently these .tar.gz files contained the files that are now being made available separately: • ncbi virtus protein o viral.1.1.protein.fna.gz | viral.2.1.protein.fna.gz | viral.3.1.protein.fna.gz | viral.4.1.protein.fna.gz. • ncbi virus genome o viral.1.1.genomic.fna | viral.2.1.genomic.fna | viral.3.1.genomic.fna | viral.4.1.genomic.fna

Doing the trace we saw that it just extracted these files, generated a single file concatenating each part of the respective database, and formatted the single file with the makeblastdb command:

Formatting Virus Genome database for blast.

makeblastdb -dbtype nucl -in ncbiVirusGenomes.fasta

Formatting Virus Protein database for blast.

makeblastdb -dbtype prot -in ncbiVirusProteins.faa

This is what we did with the downloaded files: we generated a single file Virus_Genome/ ncbiVirusGenomes.fasta and Virus_Protein/ncbiVirusProteins.faa and apply the makeblastdb command to the respective bank as above.

The bank check test we did passed without errors. If you can, we also suggest not generating the files (VOGs/vog.gene.headers.lst, VOGs/vog.protein.headers.lst) in the Databases directory, as this makes it difficult to share data in this directory in a multi-user environment.

Best regards, Thanks

carolzhou commented 2 years ago

Are these split files available on NCBI's ftp site? I am not finding them:
viral.1.1.protein.fna.gz | viral.2.1.protein.fna.gz | viral.3.1.protein.fna.gz | viral.4.1.protein.fna.gz. • ncbi virus genome o viral.1.1.genomic.fna | viral.2.1.genomic.fna | viral.3.1.genomic.fna | viral.4.1.genomic.fna

NCBI is enabling download (by hand) via their website, here: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&SourceDB_s=RefSeq

Lastly, why is it difficult to share data in the Database directory when the headers list file is present?