aquaskyline / SOAPdenovo2

Next generation sequencing reads de novo assembler.
GNU General Public License v3.0
220 stars 78 forks source link

Error reading FASTA files #32

Closed abshah closed 7 years ago

abshah commented 7 years ago

Hi SOAPdenovo2 devs, I have just noticed a strange issue. Whenever I input FASTA files (using the f1,f2 flags in the configuration file), the program just goes into a loop of trying to process the reads. I am using version 2.40 (installed from bioconda) and my command was: SOAPdenovo-63mer all -s /vol/assembly/config_Lmig_PE90_20K_insert.txt -K 43 -R -p 16 -o PE90_20K_ins 1> assembly_PE90.log 2> assembly_PE90.error and my log file looks like this:

Version 2.04: released on July 13th, 2012
Compile Jul 12 2016     07:31:42

********************
Pregraph
********************

Parameters: pregraph -s /vol/assembly/config_Lmig_PE90_20K_insert.txt -K 43 -p 16 -R -o PE90_20K_ins

In /vol/assembly/config_Lmig_PE90_20K_insert.txt, 1 lib(s), maximum read length 90, maximum name length 256.

16 thread(s) initialized.
Import reads from file:
 /vol/libs/SRR764591_1.fasta
Import reads from file:
 /vol/libs/SRR764591_2.fasta
--- 100000000th reads.
--- 200000000th reads.
--- 300000000th reads.
--- 400000000th reads.
--- 500000000th reads.
--- 600000000th reads.
--- 700000000th reads.
--- 800000000th reads.
--- 900000000th reads.
--- 1000000000th reads.
--- 1100000000th reads.

........


--- 612500000000th reads.
--- 612600000000th reads.
--- 612700000000th reads.
--- 612800000000th reads.
--- 612900000000th reads.
--- 613000000000th reads.
--- 613100000000th reads.
--- 613200000000th reads.
--- 613300000000th reads.
--- 613400000000th reads.
--- 613500000000th reads.
--- 613600000000th reads.
--- 613700000000th reads.
--- 613800000000th reads.

However, when I switch to FASTQ input files, the problem disappears.

Best Wishes, Abhijeet

cchd0001 commented 7 years ago

@abshah Could you please try the lastest code for those FASTA data ? Is it possible that sending those data to me for debug ?

abshah commented 7 years ago

Hi @cchd0001 I just recompiled SOAPdenovo2 with the latest version from github. I believe the issue was with loading multi-line FASTA files. Usually the SRA-toolkit outputs reads in FASTA format in a multi-line format. You can download the data files using fastq-dump (from SRA-toolkit) fastq-dump --split-spot --split-files --clip --fasta SRR764591

If you convert the same file to 2-line FASTA format, the issue appears to go away.

Best, Abhijeet

cchd0001 commented 7 years ago

Hi @abshah
Sorry for the delay . Something wrong with my network and I can't download your data. However , I make a multi-line FASTA simulate data and trigger the bug. It is true that SOAPdenovo2 don't support multi-line FASTA format file . It assume a FASTA file must be a 2 line FASTA and a FASTQ file must be a 4 line FASTQ . However , instead of enter a infinite loop, the latest code ( which you already tested) can detect those format and exit program with warning looks like :

Import reads from file:
 MP4000_shuffled_origin.fa
readseqInLib return error! please make sure input file is correct fastq/fasta file 
invalid data left in buffer:
CACCAATTCTAAGCATTAAGCTTtttctttattttctttctttcttttccttctttc
tttctctctctctctctttctttctttctctctctctctttctttctctttcttcct
tccttccttccttccttccttccttctctccttaattgtgggaaaatataaataaac
taaaactcatcattttcacctt

There is no plan for support multi-line FASTA format file . Hope this can help . Best wishes Lidong Guo

abshah commented 7 years ago

Hi @cchd0001 Thanks for adding this in.

Best Wishes, Abhijeet