dvitsios / mirnovo

Genome free discovery and classification of miRNAs from small RNA-Seq with random forests
MIT License
10 stars 6 forks source link

error post clustering #3

Open lrippel opened 6 years ago

lrippel commented 6 years ago

Hi All,

I'm running mirnovo stand alone:

perl mirnovo.pl -i /raid/projects/scratch/BioTrans/Pdensiflora/miRNA/reformat_to_phred64/QA/fa/Bud_Tree_C64.clipped.fa.gz -g NA --disable-genome -t universal_plants -o BudCfa

And I'm getting this error:

Reading file ../tmp/BudCfa-EKwNInLH/Bud_Tree_C64.clipped.fa 100%
15577335 nt in 729882 seqs, min 16, max 28, avg 21 Masking 100%
Sorting by length 100% Counting unique k-mers 100%
Clustering 100%
Sorting clusters 100% Writing clusters 100%
Clusters: 90723 Size min 1, max 13361, avg 8.0 Singletons: 38396, 5.3% of seqs, 42.3% of clusters Traceback (most recent call last): File "run_vsearch_clust_fast.py", line 217, in find_Nread_clusters(dir, min_read_N, min_num_of_variants) File "run_vsearch_clust_fast.py", line 98, in find_Nread_clusters depth = vals[1] IndexError: list index out of range Traceback (most recent call last): File "mirnovo_analysis.py", line 211, in cluster_reads(input_reads_fasta, usearch_perc_id, min_numb_reads, usearch_dir, min_num_of_variants, job_id) File "mirnovo_analysis.py", line 61, in cluster_reads subprocess.check_call(cmd, shell=True) File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python -u run_vsearch_clust_fast.py ../tmp/BudCfa-EKwNInLH/Bud_Tree_C64.clipped.fa.gz 0.9 5 ../tmp/BudCfa-EKwNInLH/1/usearch_out 1 1 16 28' returned non-zero exit status 1


All job progress has been saved to: ../tmp/BudCfa-EKwNInLH/bsub.log file.

Results can be found at: ../tmp/BudCfa-EKwNInLH/All-Results/

Can someone explain why and how to cope?

Thanks!

dvitsios commented 6 years ago

Can you upload a link with a minimal example file that I can use to reproduce the error you get?

rajeshgazara commented 6 years ago

Did you solve the issue? I got the same error.

forrestzhang commented 5 years ago

I also have same problem. I was trying to use online version. I DO NOT get any result too.

Job ID: 2d4ec0a8-a06c-eafd-6eef-20996247739a Successfully completed.

The results are available at the following link:

http://wwwdev.ebi.ac.uk/enright-dev/mirnovo/cgi-bin/core/display_results.php?uuid=2d4ec0a8-a06c-eafd-6eef-20996247739a&jobs_num=1

`

Output dir: ../tmp/wt2

vsearch v2.4.3_osx_x86_64, 64.0GB RAM, 8 cores https://github.com/torognes/vsearch

Reading file ../tmp/wt2/wt2.fasta 100% 142256330 nt in 6227113 seqs, min 16, max 28, avg 23 Masking 100% Sorting by length 100% Counting unique k-mers 100% Clustering 100% Sorting clusters 100% Writing clusters 100% Clusters: 537350 Size min 1, max 352382, avg 11.6 Singletons: 375756, 6.0% of seqs, 69.9% of clusters Traceback (most recent call last): File "run_vsearch_clust_fast.py", line 217, in find_Nread_clusters(dir, min_read_N, min_num_of_variants) File "run_vsearch_clust_fast.py", line 98, in find_Nread_clusters depth = vals[1] IndexError: list index out of range Traceback (most recent call last): File "mirnovo_analysis.py", line 211, in cluster_reads(input_reads_fasta, usearch_perc_id, min_numb_reads, usearch_dir, min_num_of_variants, job_id) File "mirnovo_analysis.py", line 61, in cluster_reads subprocess.check_call(cmd, shell=True) File "/Users/forrest/anaconda3/envs/mirnovopython2/lib/python2.7/subprocess.py", line 190, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python -u run_vsearch_clust_fast.py ../tmp/wt2/wt2.fasta.gz 0.9 5 ../tmp/wt2/1/usearch_out 1 1 16 28' returned non-zero exit status 1 `

dvitsios commented 5 years ago

@forrestzhang There is a known bug about processing FASTA files with tally when they have already been cleaned from 3p adapters. I will try to fix this in the pipeline as soon as I can.

Meanwhile, you can just tally your input file before uploading it to mirnovo:

  1. Download tally:

  2. Run on a bash terminal:

    file=wt2.fa.gz;
    tally -i $file --fasta-in -o $file.tallied.gz -tri 20 -l 16 -u 28 --fasta-out -format '>trn_t%T_i%I_x%C%n%R%n'
  3. Upload the wt2.fa.gz.tallied.gz to mirnovo.

Additionally, since you want to analyse samples from plant species I would advise you to try three model options:

UNIVERSAL animals may capture more canonical miRNAs than the plant-specific models. You can then get a union or consensus of your predictions from these models.

forrestzhang commented 5 years ago

@forrestzhang There is a known bug about processing FASTA files with tally when they have already been cleaned from 3p adapters. I will try to fix this in the pipeline as soon as I can.

Meanwhile, you can just tally your input file before uploading it to mirnovo:

  1. Download tally:

  2. Run on a bash terminal:
file=wt2.fa.gz;
tally -i $file --fasta-in -o $file.tallied.gz -tri 20 -l 16 -u 28 --fasta-out -format '>trn_t%T_i%I_x%C%n%R%n'
  1. Upload the wt2.fa.gz.tallied.gz to mirnovo.

Additionally, since you want to analyse samples from plant species I would advise you to try three model options:

  • species-specific (osa)
  • UNIVERSAL plants and
  • UNIVERSAL animals

UNIVERSAL animals may capture more canonical miRNAs than the plant-specific models. You can then get a union or consensus of your predictions from these models.

Thanks for your advise. But, tally have some problem with sequence name.

~/Software/mirnovo_pkg_linux_v1.0/bin/reaper-16-098/src/tally -i WT2_clean_total.fa.gz -o WT2_clean_total.fa.tallied.gz -tri 20 -l 16 -u 28 --fasta-out -format '>trn_t%T_i%I_x%C%n%R%n' 

[tally] parse error at line 1 (remaining format string [@%I%#%R%n+%#%Q%n], buffer [>HISEQ:916:CCP03ANXX:3:1101:1384:19911:N:0:AACAACCA]) [tally] data log size 28 (0.25G) hash log size 25 (0.50G) unit size 16 [tally] parse error at line 1 (remaining format string [@%I%#%R%n+%#%Q%n], buffer [>HISEQ:916:CCP03ANXX:3:1101:1384:19911:N:0:AACAACCA]) discarded_unmatched=0 discarded_alien=0 discarded_length=0 discarded_trint=0 nt_in=0 nt_out=0 passed_unique=0 passed_total=0 num_records=0 [memusage] 805306368 bytes

--------Fasta file------ >HISEQ:916:CCP03ANXX:3:1101:18421:21101:N:0:AACAACCA ACGAACGAGACCTCAGC >HISEQ:916:CCP03ANXX:3:1101:18323:21851:N:0:AACAACCA ATCACGAGAGGAACCG >HISEQ:916:CCP03ANXX:3:1101:18260:22241:N:0:AACAACCA GTGGAGCGATTTGTCTGGTTAATTCCGTTAAC >HISEQ:916:CCP03ANXX:3:1101:18355:22281:N:0:AACAACCA CCCAAGATGAGTGCTCTCTC >HISEQ:916:CCP03ANXX:3:1101:18302:22371:N:0:AACAACCA CAGCCGACTCAGAACTGGTA >HISEQ:916:CCP03ANXX:3:1101:18724:20131:N:0:AACAACCA NCGAACAGCCGACTCAGAACTG >HISEQ:916:CCP03ANXX:3:1101:18738:20941:N:0:AACAACCA AATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCC >HISEQ:916:CCP03ANXX:3:1101:18528:21621:N:0:AACAACCA GGAATTTCCGGTGGAGCGGTGAAATGCATTG

dvitsios commented 5 years ago

You omitted the --fasta-in argument (see my command in previous reply too). This option informs tally to parse a FASTA input file (instead of a FASTQ, which is the default)

dvitsios commented 5 years ago

@forrestzhang

I have now integrated a fix in mirnovo web-server for the issue described above: https://github.com/dvitsios/mirnovo/issues/3#issuecomment-449373618

Basically, mirnovo is now able to also process FASTA files which are already cleaned from their 3p-adapters. So far, the pipeline was primarily focused around FASTQ files (either raw or cleaned) or FASTA with their 3p adapters included.