HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
77 stars 25 forks source link

phyloFlash_makedb.pl gets stuck with PR2 data base #152

Closed mmcardozo closed 3 years ago

mmcardozo commented 3 years ago

Hi all, I wanted to run phyloflash with PR2 database instead of SILVA, i believe this is possible to do. I modified the file accordingly and tried to create the database as described in the instructions:

phyloFlash_makedb.pl --univec_file /home/ollie/mcardozo/databases/UniVec -overwrite -log makedb.log --silva_file /home/ollie/mcardozo/databases/SILVA_414_pr2_version.fasta

it seems to work but it has taken over 3+ days and the job gets killed due to time limit. Perhaps there is something I did wrong? is there a way to make this run faster?

here is the log file: db.txt

Many thanks in advance, Magda

kbseah commented 3 years ago

Hello Magda, how big is the PR2 database? One possibility is that the database is simply too big and should be deduplicated or clustered, although for euk sequences this is unlikely.

For comparison, the SILVA SSU Ref NR 99 database has about 500 k entries: https://www.arb-silva.de/documentation/release-1381/

Could you please attach the log file tmp.bbmask_mask_repeats.log, if it is not empty?

mmcardozo commented 3 years ago

Hi, The PR2 data base is 298M tmp.bbmask_mask_repeats.log.txt

kbseah commented 3 years ago

looks like there was a duplicate entry in the database, if you look in the log file that you attached. could you remove such duplicates and try again?

mmcardozo commented 3 years ago

Hi, Yes I had several repeated entries on the fasta file. It worked and took a lot less time. Many thanks! Magda

kbseah commented 3 years ago

Thanks for letting us know! Could you please close this issue? If a related problem comes up you can always open it again.