ComparativeSystemsBiologyGroup / SeqDex

0 stars 0 forks source link

Run failure #1

Open sihellem opened 4 years ago

sihellem commented 4 years ago

Hello,

I tried to run SeqDex on a cluster by inputting contigs.fasta produced by SPAdes and the sam alignment from Bowtie2 and I got the following error:

[bam_sort_core] merging from 0 files and 10 in-memory blocks...
[bam_sort_core] merging from 10 files and 10 in-memory blocks...
Error in data.frame(Contig = rRNA16sTaxonomy2[, 2], TaxonDensity = 1,  : 
  les arguments impliquent des nombres de lignes différents : 0, 1
Exécution arrêtée
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../output/superkingdomOutput.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
[E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'
Error in data.frame(Contig = rRNA16sTaxonomy2[, 2], TaxonDensity = 1,  : 
  les arguments impliquent des nombres de lignes différents : 0, 1
Exécution arrêtée
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../output/superkingdomOutput.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
[E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'

Here is the beginning of the SeqDex.sh file, if it could help:

#######################################################################################################
#MANDATORY variables: these have to be assigned to run SeqDex
THREADS=10
#path to folder containing blast database files as downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/
#or any other custom database with sequence titles fulfilling NCBI sequence titles formatting rules
NT=/work/BourguignonU/db/nt_2019/ 
#file name of the blast database in $NT
NTI=nt 
#path to blast database built using $RDPF downloaded from https://rdp.cme.msu.edu/misc/resources.jsp
#or any other custom database with sequence titles fulfilling RDP sequence titles formatting rules
RDP=/work/BourguignonU/SimonH/RDP_Bact_db/
#fasta file used to build $RDPI
RDPF=current_Bacteria_unaligned.fa
#file name of the blast database in $RDP
RDPI=current_Bacteria_unaligned.fa
#home folder of SeqDex
SCRIPT=/home/s/simon-hellemans/Programs/
#taxonomy information to be used. Can be NT of NTNR
TAX=NT
#machine learning algorithm to be used. Can be RF, SVM, BOTH
MLALG=BOTH
#target taxonomy name to be used to identify the target cluster
TRG=Alphaproteobacteria
#if it is equal to 'yes', then SeqDex perform the final clustering step
CLUSTER=yes
#Taxonomy level/s to be used in SeqDex. One level for only one iteration; a comma separated list (without spaces) for more than a iteration
ITER=superkingdom
#Taxonomy category target for each level defined in $ITER
ITERTRG=bacteria

Do you have any idea what went wrong?

Thanks in advance for your response, Cheers, Simon

ComparativeSystemsBiologyGroup commented 4 years ago

Hello Simon, sorry for the late response.

concerning the second part of the error

Error in file(file, "rt") : impossible d'ouvrir la connexion Calls: read.table -> file De plus : Warning message: In file(file, "rt") : impossible d'ouvrir le fichier '../output/superkingdomOutput.txt' : Aucun fichier ou dossier de ce type Exécution arrêtée [E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt' Error in data.frame(Contig = rRNA16sTaxonomy2[, 2], TaxonDensity = 1, : les arguments impliquent des nombres de lignes différents : 0, 1 Exécution arrêtée Error in file(file, "rt") : impossible d'ouvrir la connexion Calls: read.table -> file De plus : Warning message: In file(file, "rt") : impossible d'ouvrir le fichier '../output/superkingdomOutput.txt' : Aucun fichier ou dossier de ce type Exécution arrêtée [E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'

there was an error in the SeqDex.sh file. I have fixed it, so you can just download this file, substitute it in your SeqDex folder and use it.

Concerning the error

Error in data.frame(Contig = rRNA16sTaxonomy2[, 2], TaxonDensity = 1, : les arguments impliquent des nombres de lignes différents : 0, 1

do you have the rRNA16sTaxonomy2.txt file in Taxonomy folder? Also, in your SeqDex.sh file you have #file name of the blast database in $RDP RDPI=current_Bacteria_unaligned.fa but in this filed you have to put the base name of blast database build on the unaligned RDP 16S database. Is the name of the file correct?

sihellem commented 4 years ago

Hello,

Thank you for your response. There was indeed an error for the base name of the database for 16S.

I ran it again and experienced another error (see below). It seems SeqDex does not find the taxonomy sql file from taxonomizr, which was built as indicated in the documentation.

Is there a way around the problem?

Best, Simon

Error in if (!file.exists(sqlFile)) stop(sqlFile, " does not exist.") :
  l'argument est de longueur nulle
Exécution arrêtée
Error in read.table(opt$taxaRDP, header = FALSE, stringsAsFactors = FALSE,  :
  pas de lignes disponibles dans l'entrée
Exécution arrêtée
mkdir: impossible de créer le répertoire « SVMoutput »: Le fichier existe
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../Taxonomy/superkingdomTaxonomyIteration.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
mkdir: impossible de créer le répertoire « ClusteringOutputSVM »: Le fichier existe
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../SVMoutput/superkingdomOutputSVM.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
[E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'
mkdir: impossible de créer le répertoire « RFoutput »: Le fichier existe
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../Taxonomy/superkingdomTaxonomyIteration.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
mkdir: impossible de créer le répertoire « ClusteringOutputRF »: Le fichier existe
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../RFoutput/superkingdomOutputRF.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
[E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'
ComparativeSystemsBiologyGroup commented 4 years ago

Hi Simon, I see some little errors.

First, SeqDex searches for the Taxonomizer file (called exactly "accessionTaxa.sql") locally. If you have moved the file to an external HD, SeqDex will not find it. I would optimise it in the near future, but by now you can or move back the sql file to the PC main HD or change manually the path where SeqDex search for the sql file. In Func.R file, you can edit the lines 71-72

path <- list.files(path= "~", full.names=TRUE, recursive=TRUE,pattern="(accessionTaxa.sql)")

and change the "~" with the path to the external HD. It should be something like this

path <- list.files(path= "/Volumes/external_HD_name", full.names=TRUE, recursive=TRUE,pattern="(accessionTaxa.sql)")

Second, SeqDex does not find a file needed for the 16S taxonomy. It's a file called "RDP16s_taxa_mod.txt". I really cannot understand why, as SeqDex should rerun the 16S part if this file is not in the right location, until you have not commented out some part of the SeqDex.sh file and then removed the "RDP16s_taxa_mod.txt" file. Have you the "RDP16s_taxa_mod.txt" in the Taxonomy folder?

The other errors are error messages due to the fact that the folders that SeqDex is trying to create exist already, or impossibility to complete tasks due to the absence of the taxonomy files.

Best, Alice

sihellem commented 4 years ago

Dear Alice,

Thank you for your reply. I had indeed to move "accessionTaxa.sql" to the cluster /work/ partition so I will specify the path in Func.R.

I confirm I did not touch anything from the Taxonomy folder. However, as you can see below, the mentioned file, as well as others, are there but empty.

 8.7K Feb 24 01:06 16sContigs.fasta
  134 Feb 24 01:06 16scontigsName.txt
    0 Feb 24 01:06 16sContigvsRDP.txt
 2.3K Feb 24 01:06 barrnap16s_contigs.gff
  30G Feb 24 01:04 ContigsvsNt.txt
    0 Feb 24 01:06 RDP16s_taxa_mod.txt
    0 Feb 24 01:06 RDP16s_taxa.txt
    0 Feb 24 01:06 RDP16s.txt

Any idea why?

Best, Simon

ComparativeSystemsBiologyGroup commented 4 years ago

Dear Simon,

I suppose that there is again a problem with the variable used for the 16S part. In the SeqDex.sh file you have

#path to blast database built using $RDPF downloaded from https://rdp.cme.msu.edu/misc/resources.jsp
#or any other custom database with sequence titles fulfilling RDP sequence titles formatting rules
RDP=~/database/rdp16s
#fasta file used to build $RDPI
RDPF=current_Bacteria_unaligned.fa
#file name of the blast database in $RDP
RDPI=rdp16S

Therefore,

Have you completed these fields correctly?

You mention you have other empty files; which one?

Alice

sihellem commented 4 years ago

Dear Alice,

Here is the log from the run after modifying the path to sql database:

[bam_sort_core] merging from 10 files and 10 in-memory blocks...
[bam_sort_core] merging from 10 files and 10 in-memory blocks...
BLAST Database error: No alias or index file found for nucleotide database [/work/TEAM/SimonH/databases/RDP_Bact_db/current_Bacteria_unaligned] in search path [/work/TEAM/SimonH/Neotropical/seqdex/seqdex_bwa/Taxonomy::]
Error in read.table(opt$taxaRDP, header = FALSE, stringsAsFactors = FALSE,  : 
  pas de lignes disponibles dans l'entrée
Exécution arrêtée
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../Taxonomy/rRNA16sTaxonomy2.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../SVMoutput/superkingdomOutputSVM.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
[E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../Taxonomy/rRNA16sTaxonomy2.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../RFoutput/superkingdomOutputRF.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
[E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'

I confirm there is no error in the SeqDex file:

#path to blast database built using $RDPF downloaded from https://rdp.cme.msu.edu/misc/resources.jsp
#or any other custom database with sequence titles fulfilling RDP sequence titles formatting rules
RDP=/work/TEAM/SimonH/databases/RDP_Bact_db
#fasta file used to build $RDPI
RDPF=current_Bacteria_unaligned.fa
#file name of the blast database in $RDP
RDPI=current_Bacteria_unaligned

With database constructed where specified:

[simon-hellemans@sango-login2 RDP_Bact_db]$ pwd
/work/TEAM/SimonH/databases/RDP_Bact_db
[simon-hellemans@sango-login2 RDP_Bact_db]$ ls
current_Bacteria_unaligned.fa      current_Bacteria_unaligned.fa.nin  current_Bacteria_unaligned.fa.nsd  current_Bacteria_unaligned.fa.nsq
current_Bacteria_unaligned.fa.nhr  current_Bacteria_unaligned.fa.nog  current_Bacteria_unaligned.fa.nsi  current_Bacteria_unaligned.gb

In Taxonomy/, empty files are the following:

    0 Feb 24 01:06 16sContigvsRDP.txt
    0 Feb 24 01:06 RDP16s_taxa_mod.txt
    0 Feb 24 01:06 RDP16s_taxa.txt
    0 Feb 24 01:06 RDP16s.txt

Sorry if I am missing something here..

Best, Simon

ComparativeSystemsBiologyGroup commented 4 years ago

Dear Simon, the error continue to say that it does not find the blast database.

I suppose it is because you wrote in the SeqDex.sh file RDPI=current_Bacteria_unaligned but should be RDPI=current_Bacteria_unaligned.fa according to the ls output.

Blast here is searching for files named current_Bacteria_unaligned.nhr, current_Bacteria_unaligned.nin, current_Bacteria_unaligned.nog, and so on, but it cannot find them as your database files are named current_Bacteria_unaligned.fa.nhr, current_Bacteria_unaligned.fa.nin, current_Bacteria_unaligned.fa.nog, etc.

Try check this point and let me know.

Best, Alice

sihellem commented 4 years ago

Dear Alice,

Sorry for the delay of my answer. So it seems to get better, but it is still not it. I corrected RDPI to the state I initially set it, and as you also suggested in your last message.

Now, Taxonomy/ files are not empty, good!

        8894  2 avr 09:24 16sContigs.fasta
         134  2 avr 09:24 16scontigsName.txt
      107214  2 avr 09:24 16sContigvsRDP.txt
        2355  2 avr 09:24 barrnap16s_contigs.gff
 32149827807  2 avr 09:23 ContigsvsNt.txt
      193149  2 avr 09:25 RDP16s_taxa_mod.txt
      194164  2 avr 09:25 RDP16s_taxa.txt
       11407  2 avr 09:24 RDP16s.txt
         158  2 avr 10:55 rRNA16sTaxonomy2.txt
      146122  2 avr 10:54 superkingdomTaxonomyIteration.txt

I just verified and both folders SVMoutput/ and RFoutput/ are completely empty, and it follows that fasta files written in ClusteringOutputSVM/ and ClusteringOutputRF/ are empty as well.

Here are the remaining errors:

[bam_sort_core] merging from 10 files and 10 in-memory blocks...
[bam_sort_core] merging from 10 files and 10 in-memory blocks...
Error in data.frame(Contig = rRNA16sTaxonomy2[, 2], TaxonDensity = 1,  :
  les arguments impliquent des nombres de lignes différents : 0, 1
Exécution arrêtée
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../SVMoutput/superkingdomOutputSVM.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
[E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'
Error in data.frame(Contig = rRNA16sTaxonomy2[, 2], TaxonDensity = 1,  :
  les arguments impliquent des nombres de lignes différents : 0, 1
Exécution arrêtée
Error in file(file, "rt") : impossible d'ouvrir la connexion
Calls: read.table -> file
De plus : Warning message:
In file(file, "rt") :
  impossible d'ouvrir le fichier '../RFoutput/superkingdomOutputRF.txt' : Aucun fichier ou dossier de ce type
Exécution arrêtée
[E::stk_subseq] failed to read the list of regions in file 'OutputClustering.txt'

It seems to point to a problem in the file rRNA16sTaxonomy2.txt which is in Taxonomy/ folder. I just verified and this file is actually empty (only contains the file header). In this folder, it is also the case of the file 16scontigsName.txt.

However, all other files from this folder contain analyses results. Would it actually be that submitted data to SeqDex actually cannot pass it further somehow?

Best, Simon

ComparativeSystemsBiologyGroup commented 4 years ago

Hi Simon,

if the 16scontigsName.txt is empty, it is possible that there is some issue with the prediction of the 16S genes done by barrnap. I bet that the file barrnap16s_contigs.gff is not empty, but can you please post me a few lines? Just to be able to check the structure of the output.

Indeed, without the 16scontigsName.txt file, SeqDex is unable to draw the table with the 16S contigs and their taxonomy, and thus cannot use it to find the16S gene with higher coverage among the one with the ones with the taxonomy indicated in $TRG. SeqDex in this case perform the prediction, but cannot return the tables with the contigs of interest.

Best, Alice

sihellem commented 4 years ago

Dear Alice,

Thank you for your answer. Indeed, the Taxonomy/barrnap16s_contigs.gff file is not empty. Here is its full content:

##gff-version 3
NODE_104885_length_332_cov_0.530806 barrnap:0.9 rRNA    164 267 2.3e-07 -   .   Name=5S_rRNA;product=5S ribosomal RNA
NODE_133557_length_316_cov_0.117949 barrnap:0.9 rRNA    151 237 4.1e-08 +   .   Name=5S_rRNA;product=5S ribosomal RNA (partial);note=aligned only 73 percent of the 5S ribosomal RNA
NODE_13431_length_458_cov_0.569733  barrnap:0.9 rRNA    124 235 1.2e-20 -   .   Name=5S_rRNA;product=5S ribosomal RNA
NODE_163559_length_302_cov_0.331492 barrnap:0.9 rRNA    102 189 2.9e-10 +   .   Name=5S_rRNA;product=5S ribosomal RNA (partial);note=aligned only 73 percent of the 5S ribosomal RNA
NODE_168098_length_300_cov_0.335196 barrnap:0.9 rRNA    86  197 6.2e-21 -   .   Name=5S_rRNA;product=5S ribosomal RNA
NODE_18671_length_437_cov_0.629747  barrnap:0.9 rRNA    279 351 6.8e-07 -   .   Name=5S_rRNA;product=5S ribosomal RNA (partial);note=aligned only 61 percent of the 5S ribosomal RNA
NODE_1_length_3834_cov_14.843523    barrnap:0.9 rRNA    1642    2577    1.5e-76 -   .   Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 59 percent of the 16S ribosomal RNA
NODE_1_length_3834_cov_14.843523    barrnap:0.9 rRNA    2903    3550    5.3e-09 -   .   Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 40 percent of the 16S ribosomal RNA
NODE_215818_length_283_cov_0.481481 barrnap:0.9 rRNA    78  172 3e-13   +   .   Name=5S_rRNA;product=5S ribosomal RNA
NODE_220014_length_282_cov_0.372671 barrnap:0.9 rRNA    120 217 2.2e-09 +   .   Name=5S_rRNA;product=5S ribosomal RNA
NODE_230787_length_279_cov_0.506329 barrnap:0.9 rRNA    45  139 2.9e-08 -   .   Name=5S_rRNA;product=5S ribosomal RNA
NODE_2961_length_577_cov_0.587719   barrnap:0.9 rRNA    94  522 2.8e-108    -   .   Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 27 percent of the 16S ribosomal RNA
NODE_2_length_3745_cov_15.226269    barrnap:0.9 rRNA    1517    3568    2.9e-99 -   .   Name=23S_rRNA;product=23S ribosomal RNA (partial);note=aligned only 63 percent of the 23S ribosomal RNA
NODE_32715_length_404_cov_0.413428  barrnap:0.9 rRNA    130 236 5.9e-08 -   .   Name=5S_rRNA;product=5S ribosomal RNA
NODE_53419_length_376_cov_0.501961  barrnap:0.9 rRNA    220 310 8e-09   +   .   Name=5S_rRNA;product=5S ribosomal RNA (partial);note=aligned only 76 percent of the 5S ribosomal RNA
NODE_6624_length_507_cov_1.225389   barrnap:0.9 rRNA    1   457 1.3e-101    -   .   Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 28 percent of the 16S ribosomal RNA

Thank you for your time and help, Simon

ComparativeSystemsBiologyGroup commented 4 years ago

Hi Simon, sorry for the very late response.

The barranp gff file seems ok to me. I suppose that it may occur some issue with the rRNA16S.R script but without your 16S related file I am not able to control and correct it. If you want, you can sent me the files at alice.chiodi@unimi.it

Best, Alice