HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
77 stars 25 forks source link

Not recognize my own database #115

Closed SumTot closed 4 years ago

SumTot commented 4 years ago

Hello! I try to do an analysis with phyloFlash but using my own database (with SILVA format). But this one, is not recognized as a real database.

When I try to do the analysis, I get this Error: "_Failed to find a suitable DBHOME. Please provide a path using -dbhome. You can build a reference database using phyloFlashmakedb.pl"

I tried to indicate the path using "-dbhome" and also building the database with phyloFlash_makedb.pl, but as it´s not a real SILVA database, then it is not recognized. What could I do to use my own database? The name of my database is: SILVA_OWN.fasta.gz Should I build the database with a specific script that I forgot? Thank you very much!

kbseah commented 4 years ago

Hello @SumTot , You'll have to make sure that each entry in the database contains a taxonomy string, formatted just like the SILVA database, then use phyloFlash_makedb.pl to build the database with the --silva_file option. You'll still need a copy of the univec file, too. More details are here under section 4.3: https://hrgv.github.io/phyloFlash/install.html Hope this helps! -- Brandon

SumTot commented 4 years ago

Thanks Brandon. I needed to include a "version number" in the name of my database. Now it´s recognized :) However, I get a different error now during the building: FATAL:Tool execution failed! Error was ´No such file or directroy´and return code 256. Aborting.

I have checked the log files and the file: "tmp.vsearch_make_udb.log" shows this warning: _The makeudb_search command does not support multithreading. Only 1 thread used. Fatal error: Unable to read from file (./1//SILVASSU.noLSU.masked.trimmed.fasta) Could it be that the problem? How can I solve it? Thanks!

kbseah commented 4 years ago

Hello @SumTot , could you please post the results of ls -hl ./1, as well as the full contents of the log file? This will help to diagnose which step failed.

SumTot commented 4 years ago

Hi @kbseah , The results of ls -hl ./1 are: _total 4,0K -rw-rw-r-- 1 bea** bea 0 Jun 3 11:50 SILVA_SSU.noLSU.masked.trimmed.fasta -rw-rw-r-- 1 bea bea 95 Jun 3 11:50 SILVA_SSU.noLSU.masked.trimmed.fasta.UniVec_contamination_stats.txt -rw------- 1 bea bea 0 Jun 3 11:50 SILVASSU.noLSU.masked.trimmed.udb

Log file content: _WARNING: The makeudb_usearch command does not support multithreading. Only 1 thread used. vsearch v2.14.2_linux_x86_64, 62.8GB RAM, 56 cores https://github.com/torognes/vsearch Fatal error: Unable to read from file (./1//SILVASSU.noLSU.masked.trimmed.fasta)

kbseah commented 4 years ago

Hm it looks like the files are all empty, this is strange. Sorry by log file I meant the makedb log, not the tmp.vsearch_make_udb.log log file. What was the command line that you used to run phyloFlash_makedb.pl?

Could you try re-running with the options -keep (don't delete temporary files), -overwrite (rerun from scratch without trying to restart from potentially corrupted intermediate output), and -log makedb.log (write makedb log to a file instead of only writing to screen).

Just to check: the sequences that you are interested in are not LSU rRNA sequences by any chance, are they?

SumTot commented 4 years ago

Hi @kbseah, the command line that I used to run phyloFlash_makedb.pl is: phyloFlash_makedb.pl --univec_file /home/Databases/Univec.txt --silva_file /home/anaconda2/lib/phyloFlash/SILVA_1_ITS2SPB.fasta.gz

[I also tried the univec file with the extension fasta (Univec.fa)].

Now, I have added the options that you suggested to me and this is the error (appears in phyloFlash_log_on_error file): "[06:20:35] Saving log to file phyloFlash_log_on_error[06:21:59] Checking for required tools. [06:21:59] Using barrnapHGV found at "/home/anaconda2/lib/phyloFlash/barrnap-HGV/bin/barrnap_HGV". [06:21:59] Using bbmap found at "/home/anaconda2/bin/bbmap.sh". [06:21:59] Using bowtiebuild found at "/home/anaconda2/bin/bowtie-build". [06:21:59] Using vsearch found at "/home/anaconda2/bin/vsearch". [06:21:59] Using bbmask found at "/home/anaconda2/bin/bbmask.sh". [06:21:59] Using grep found at "/bin/grep". [06:21:59] Using bbduk found at "/home/anaconda2/bin/bbduk.sh". [06:21:59] All required tools found. [06:21:59] using local copy of univec: /home/Databases/Univec.fa [06:21:59] using local copy of Silva SSU RefNR: /home/anaconda2/lib/phyloFlash/SILVA_1_ITS2SPB.fasta.gz [06:21:59] unpacking SILVA database [06:21:59] searching for LSU contamination in SSU RefNR [06:21:59] running subcommand: /home/anaconda2/lib/phyloFlash/barrnap-HGV/bin/barrnap_HGV --kingdom bac --threads 56 --evalue 1e-10 --gene lsu --reject 0.01 ./1/SILVA_SSU.fasta >tmp.barrnap_hits.bac.gff 2>tmp.barrnap_hits.bac.barrnap.out [06:22:01] running subcommand: /home/anaconda2/lib/phyloFlash/barrnap-HGV/bin/barrnap_HGV --kingdom arch --threads 56 --evalue 1e-10 --gene lsu --reject 0.01 ./1/SILVA_SSU.fasta >tmp.barrnap_hits.arch.gff 2>tmp.barrnap_hits.arch.barrnap.out [06:22:02] running subcommand: /home/anaconda2/lib/phyloFlash/barrnap-HGV/bin/barrnap_HGV --kingdom euk --threads 56 --evalue 1e-10 --gene lsu --reject 0.01 ./1/SILVA_SSU.fasta >tmp.barrnap_hits.euk.gff 2>tmp.barrnap_hits.euk.barrnap.out [06:22:04] Removing sequences with potential LSU contamination [06:22:04] Number of sequences to skip: 47 [06:22:04] masking low entropy regions in SSU RefNR [06:22:04] running subcommand: /home/anaconda2/bin/bbmask.sh overwrite=t -Xmx10g threads=56 in=./1//SILVA_SSU.noLSU.fasta out=./1//SILVA_SSU.noLSU.masked.fasta minkr=4 maxkr=8 mr=t minlen=20 minke=4 maxke=8 fastawrap=0 2>tmp.bbmask_mask_repeats.log [06:22:06] removing UniVec contamination in SSU RefNR [06:22:06] running subcommand: /home/anaconda2/bin/bbduk.sh ref=/home/Databases/Univec.fa overwrite=t -Xmx10g threads=56 fastawrap=0 ktrim=r ow=t minlength=800 mink=11 hdist=1 in=./1//SILVA_SSU.noLSU.masked.fasta out=./1//SILVA_SSU.noLSU.masked.trimmed.fasta stats=./1//SILVA_SSU.noLSU.masked.trimmed.fasta.UniVec_contamination_stats.txt 2>tmp.bbduk_remove_univec.log [06:22:13] Vsearch v2.5.0+ found, will index database to UDB file [06:22:13] Indexing ./1//SILVA_SSU.noLSU.masked.trimmed.fasta to make UDB file ./1//SILVA_SSU.noLSU.masked.trimmed.udb with Vsearch [06:22:13] running subcommand: /home/anaconda2/bin/vsearch --threads 56 --notrunclabels --makeudb_usearch ./1//SILVA_SSU.noLSU.masked.trimmed.fasta --output ./1//SILVA_SSU.noLSU.masked.trimmed.udb 2>tmp.vsearch_make_udb.log [06:22:13] FATAL: Tool execution failed!. Error was '' and return code '256' Aborting. [06:22:13] Saving log to file phyloFlash_log_on_error"

The database is composed by ITS sequences but not LSU.

kbseah commented 4 years ago

Sorry that this keeps causing problems. Could you post the contents of log file tmp.vsearch_make_udb.log?

How many sequences are in the file SILVA_1_ITS2SPB.fasta.gz? Scanning for LSU contamination removed 47 sequences. Checking that this step didn't remove most of your sequences.

SumTot commented 4 years ago

Well..I think we will figure it out what happen! :) Thanks for trying to help!

content of tmp.vsearch_make_udb.log: _WARNING: The makeudb_usearch command does not support multithreading. Only 1 thread used. vsearch v2.14.2_linux_x86_64, 62.8GB RAM, 56 cores https://github.com/torognes/vsearch

Fatal error: Unable to read from file (./1//SILVASSU.noLSU.masked.trimmed.fasta)

The number of sequences of the file exceeds 18000 number. So, I guess I still have enough sequences.

kbseah commented 4 years ago

Hm okay so that hunch wasn't right...

If you are okay with it, could you send me the first 1000 sequences or so in the file, via Dropbox or similar? I could try to reproduce the problem on my side. My email address is kb.seah@tuebingen.mpg.de

kbseah commented 4 years ago

@HRGV could you please close this issue, has been resolved

SumTot commented 4 years ago

I wanted to thank all the help of @kbseah Helped me to solve very well the problem :)