dvitsios / mirnovo

Genome free discovery and classification of miRNAs from small RNA-Seq with random forests
MIT License
10 stars 6 forks source link

need multi threads for running the program #4

Open Kennyluo4 opened 5 years ago

Kennyluo4 commented 5 years ago

Hello, I'm using this software for my own species(no pre-built ref genome). ./mirnovo.pl -i S1.fasta.gz -g NA -t universal_plants -o S1_small The file S1.fasta.gz is around 200Mb, seems it's running very slow. I have been waiting for 3 days and it's still running. Can you add multi-processing function to this software so that I can use more cpus for this work? I checked the ouput file in ./temp/ folder. It generated hundreds of Gb of files, which is insane. I really don't know what's going on in this software. Also, this there a way to build my own genome index for higher accuracy?

BTW, I also tried web-based version, it successfully finished the process after two days, but the result page is empty, how did that happen?

dvitsios commented 5 years ago

Hi, mirnovo is already multi-threaded in multiple sub-modules. In this version, it supports up to 40 threads in parallel (currently the max. num of threads is fixed in the code, I'll add it as a parameter in the next version). That means though that it's not fully optimised only if you'll try to use > 40 cores.

The main issue with your file seems to be the high complexity in the input file, i.e. there are too many unique (near identical) sequences that lead to too many clusters when using vsearch. To deal with that, you can tweak three parameters:

  1. --reduce-complexity -n [tally_threshold] By setting e.g. tally_threshold=3, you will filter out any sequences in your input file that have total coverage < 3, before starting the initial clustering
  2. increase the min_variants (-m) parameter to select only clusters of sequences that contain at least min_variants unique sequences
  3. increase the min_read_depth (-d) parameter to select only clusters of sequences with total read depth >=min_read_depth.

I'd recommend that you start with something like that: ./mirnovo.pl -i [input_file] -g NA -t universal_animals --reduce-complexity -n 7 -m 5 -d 20

but eventually try running it with a bit less extreme parameters like: ./mirnovo.pl -i [input_file] -g NA -t universal_animals --reduce-complexity -n 3 -m 5 -d 10

Finally, you can install your own reference genome in mirnovo, instructions are available here: http://wwwdev.ebi.ac.uk/enright-dev/mirnovo-standalone-pkg/Genome-Annotation-1.0/README Section: "Install a Reference Genome manually (e.g. Homo sapiens - hsa)", at the bottom.

Kennyluo4 commented 5 years ago

Thanks for the information. I'll try what you suggested. I see those parameters discussed in the paper, but never know how to correctly set them (e.g. to use --reduce-complexity or -r) . It would be nice if you put the [option] description in the software help() or README in the future.

After first try: ./mirnovo.pl -i S7H4_ah_mir.fasta.gz -g NA -t universal_plants --reduce-complexity -n 7 -m 5 -d 20 -o test it raised some argument for each read:

Argument ">CL100126290L1C012R006_414987/1" isn't numeric in numeric ge (>=) at ./ultra_fast_filter.pl line 53, line 54081692

I'm not sure if this is a concern or not. Then the program stopped with following error, which were also reported by others:

subprocess.CalledProcessError: Command 'python -u run_vsearch_clust_fast.py ../tmp/test3-xJUFTmsa/S7H4_ah_mir.fasta.gz 0.9 20 ../tmp/test3-xJUFTmsa/1/usearch_out 5 1 16 28' returned non-zero exit status 1 seems there is something wrong when calling 'python -u run_vsearch_clust_fast.py'

According to your announcement in that issue comment

mirnovo is now able to also process FASTA files which are already cleaned from their 3p-adapters. So far, the pipeline was primarily focused around FASTQ files (either raw or cleaned) or FASTA with their 3p adapters included.

Is this issue supposed to be fixed? Before spending more time playing around tally to treat my reads. I simply used fastq.gz file for another try. It's running for couple of hours now, and I hope it would be okay. Even thought the log file output an issue of:

Can't locate SeqComplex.pm in @INC (you may need to install the SeqComplex module)

Will this problem affect the final result or kill the program eventually?