OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.92k stars 334 forks source link

Fastq files with MS-DOS end-line break #236

Open Erlor opened 4 years ago

Erlor commented 4 years ago

Hello, So I have been running fastp and it has worked previously, but recently I ran a sample where roughly 100K reads out of 1M reads were trimmed and then fastp stops with the message:

ERROR: sequence and quality have different length:
@read name
Read Sequence

+

I did dig a bit in the fastqreader code and it seems that there's some offset by 1 here so that the separator line now is empty and the separator instead occupies the line for quality.

I did find that the file has ^M$ at the end of each line indicating the file has been saved on a windows machine. By looking at the getline function in the fastqreader it seems you tried to deal with this, but it seems the problem has persisted.

mschilli87 commented 4 years ago

Can you paste the output of od -c < your.fastq | head -n 100 (or zcat your.fastq.gz | oc -c | head -n 100 in case you FASTQ file is compressed)?

Erlor commented 4 years ago

The output was quite lengthy. I also tried if it can be reproduced in another way, which it can. It can be achieved by running any fastq file through unix2dos (Basically introducing the \r to each line ending).

0000000   @   S   R   R   8   5   6   1   4   1   3   .   1       1   /
0000020   1  \r  \n   A   T   G   G   C   G   G   C   G   G   C   G   G
0000040   G   C   C   T   G   G   C   G   G   A   A   C   T   G   C   T
0000060   G   G   G   C   G   G   A   A   G   C   C   C   G   A   C   G
0000100   C   A   G   G   T   G   T   G   C   A   T   C   G   C   G   G
0000120   C   T   G   A   A   A   T   C   G   G   C   A   T   G   G   A
0000140   A   C   A   T   A   A   C   C   T   T   G   G   T   T   A   A
0000160   C   C   T  \r  \n   +  \r  \n   C   C   D   A   C   B   6   ;
0000200   ;   7   ;   ;   ;   1   ;   6   ;   ;   6   ;   ;   6   ;   6
0000220   ;   2   :   :   2   2   2   *   2   2   .   2   .   2   ;   C
0000240   9   ;   ;   ;   ;   0   0   /   +   .   .   -   -   4   4   B
0000260   C   C   C   C   B   ?   >   >   C   D   >   C   C   C   A   C
0000300   D   F   F   >   >   ;   :   :   :   8   2   8   2   8   2   1
0000320   )   1   )   1   3   3   7   .  \r  \n   @   S   R   R   8   5
0000340   6   1   4   1   3   .   2       2   /   1  \r  \n   G   A   C
0000360   T   G   A   A   G   C   A   G   G   G   C   A   G   C   T   C
0000400   T   A   C   T   T   T   G   A   G   C   G   G   T   G   C   A
0000420   G   G   C   G   T   A   T   T   G   T   G   G   A   T   G   A
0000440   A   G   C   G   A   G   G   C   T   G   G   C   A   C   A   T
0000460   G   A   A   C   A   G   T   T   A   C   C   G   G   G   G   T
0000500   G   A   T   A   G   T   G   A   T   A   G   A   C   G   A   C
0000520   C   T   C   G   G   G   T   A   T   G   T   C   A   A   A   C
0000540   G   C   G   A   C   A   A   C   G   C   G   G   A   A   A   C
0000560   G   G   G   G   T   G   T   G   T   T   G   T   T   C   G   A
0000600   G   G   C   G   T   G   A   T   C   G   C   A   C   A   T   C
0000620   G   T   T   A   T   G   A   G   C   G   C   G   G   C   A   G
0000640   T   C   T   G   G   T   C   G   A   T   C   A   C   C   A   G
0000660   C   A   A   T   C   A   G   T   C   C   G   T   T   C   A   G
0000700   C   A   C   G   T   G   G   G   G   G   T   A   G   C   A   T
0000720   C   T   T   C   G   T   G   G   A   T   G   A   A   A   C   G
0000740   G   A   T   G   G   C   G   G   T   G   G   C   A   G   C   A
0000760   G   C   C   G   A   T   C   G   T   C   T   G   A   T   C   C
0001000   A   G   T   A   C   G   G   G   G   T   A   T   A   T   G   T
0001020   T   C   G   A   G   C   T   A   A   A   A   A   G   G   A   G
0001040   A   A   A   G   T   T   A   C   C   G   G   A   A   A   A   A
0001060   G   A   C   T   G   G   C   A   A   G   G   C   A   G   T   A
0001100   A   C   C   A   G   C   G   T   G   G  \r  \n   +  \r  \n   >
0001120   @   @   D   C   C   =   @   @   ;   ;   ;   1   5   4   ;   ;
0001140   ;   ;   @   ;   ;   :   9   *   :   B   B   ;   ;   7   ;   ;
0001160   <   /   /   *   -   .   .   ;   ;   8   ;   ;   ;   6   ;   @
0001200   @   A   >   C   >   ?   ;   A   =   @   :   ;   7   ;   ;   6
0001220   6   ;   ;   ;   >   C   ;   6   ;   7   ;   D   0   ;   5   ;
0001240   ,   5   4   C   C   C   D   C   C   C   C   ?   ;   ;   A   @
0001260   ;   ;   6   ;   :   :   :   /   9   B   @   :   2   2   2   :
0001300   /   9   A   :   :   :   :   :   <   :   :   :   B   4   :   9
0001320   /   2   :   :   :   *   2   :   9   :   @   >   :   8   5   /
0001340   /   -   /   '   -   -   -   -   -   3   9   9   9   ?   :   :
0001360   :   :   3   :   1   .   /   /   -   2   7   7   <   ?   9   >
0001400   ?   ?   ?   @   @   @   ?   ;   :   :   :   :   :   :   :   4
0001420   9   ?   8   8   2   :   @   >   ?   ?   B   ;   ?   ?   =   A
0001440   A   @   A   :   8   :   :   8   8   8   8   *   8   8   8   8
0001460   7   :   =   C   3   3   :   1   1   2   -   -   -   -   4   -
0001500   -   -   '   -   8   7   2   8   B   <   >   ?   ;   ?   ;   ;
0001520   ;   ;   ;   C   ?   C   9   9   0   0   0   >   8   <   7   0
0001540   0   *   0   /   /   /   /   /   /   /   (   8   7   1   1   1
0001560   1   1   )   1   1   1   1   1   1   1   3   3   3   0   ;   ;
0001600   @   @   @   D   1   :   :   =   B   B   =   A   =   A   C   C
0001620   C   3   :   0   0   0   0   ,   /   0   5   :   *   0   ;   :
0001640   <   =   7   <   9   ?   ?   <   7   7   7   )  \r  \n   @   S
0001660   R   R   8   5   6   1   4   1   3   .   3       3   /   1  \r
0001700  \n   T   C   C   C   T   T   C   A   T   A   C   T   G   C   A
0001720   C   G   T   A   G   A   G   C   T   G   C   C   G   C   A   G
0001740   T   T   C   A   T   C   G   G   C   A   T   A   A   G   C   C
0001760   T   G   C   A   G   G   A   A   T   T   C   C   G   G   G   G
0002000   T   A   A   A   G   A   C   G   T   C   A   C   G   A   C   G
0002020   G   T   G   C   T   G   C   A   A   C   C   A   G   C   G   C
0002040   T   G   C   T   G   C   T   C   G   G   C   T   T   C   G   T
0002060   C   C   A   G   C   G   T   G   C   C   G   G   G   G   A   A
0002100   A   T   T   A   C   G   C   G   C   C   C   G   A   T   A   G
0002120   T   T   G   A   A   C   A   G   C   A   G   T   T   T   C   T
0002140   C   G   A   T   G   C   G   T   T   T   A   T   C   G   G   C
0002160   A   A   A   C   G   T   G   A   G   G   T   C   C   A   G   C
0002200   G   C   G   G   G   C   A   G   A   T   T   G   C   G   T   G
0002220   C   T   C   G   G   T   C   T   C   C   A   G   C   A   G   G
0002240   A   T   T   T   T   C   A   T   C   G   C   C   G   C   G   C
0002260   G   G   T   C   C   G   C   A   T   C   G   C   T   G   A   A
0002300   G   A   A   A   C   C   A   T   T   G   T   A   G   A   G   C
0002320   T   G   G   G   T   G   T   C   G   A   C   G   T   T   T   T
0002340   C   C   G   A   C   G   G   C   A   C   A   A   A   A   G   G
0002360   C   T   C   C   G   C   C   T   C   G   G   C   A   A   A   A
0002400   G   T   C   G   C   C   A   C   G   A   C   T   T   G   C   G
0002420   C   G   T   A   C   C   G   T   G   C   G   G   T   A   T   C
0002440   G   C   G   C   A  \r  \n   +  \r  \n   0   6   6   ,   6   0
0002460   6   ;   ;   ;   ;   >   D   C   B   C   C   B   ;   ;   ;   :
0002500   :   :   :   :   4   :   :   :   :   :   4   :   @   =   =   ?
0002520   @   ?   C   D   G   A   D   C   D   @   >   <   A   A   7   ;
0002540   7   <   7   <   7   ;   ;   ;   /   ;   B   B   ;   B   B   ;
0002560   >   ?   ?   ?   >   C   D   B   @   ;   ;   ;   ;   ;   >   D
0002600   D   C   F   A   D   D   C   C   C   @   A   @   >   <   A   A
0002620   ;   ;   7   ;   @   =   ;   ;   >   B   ?   C   C   C   C   C
0002640   C   E   @   B   B   B   .   :   :   /   :   5   :   :   :   =
0002660   :   :   :   /   =   B   @   @   @   C   5   :   :   5   :   :
0002700   :   :   :   :   B   B   7   B   B   <   8   :   :   @   @   :
0002720   :   :   /   :   8   8   8   2   8   C   C   >   C   C   C   B
0002740   @   @   7   ;   ;   7   ;   ;   ;   ;   @   @   @   :   ?   >
0002760   >   >   D   A   C   C   C   D   C   C   C   C   C   @   C   C
0003000   B   C   <   @   :   :   :   :   5   :   8   8   8   *   8   =
0003020   <   <   <   B   5   ;   ;   ;   :   8   =   8   8   4   ;   :
0003040   :   B   B   >   ?   ?   ?   ?   ;   ?   9   9   3   9   3   7
0003060   7   *   0   0   0   0   0   0   8   8   :   :   -   0   :   :
mschilli87 commented 4 years ago

My best guess is that one of the following lines should match the other as they seem to be supposed do the same job but differ:

https://github.com/OpenGene/fastp/blob/e01e9402c3d5afded49b21c8303be51d7cbb2d27/src/fastqreader.cpp#L116-L118

https://github.com/OpenGene/fastp/blob/e01e9402c3d5afded49b21c8303be51d7cbb2d27/src/fastqreader.cpp#L145-L147

Right now I'm not in the state of mind to dig deeper but it looks like the latter is the older version and the former was changed fixing https://github.com/OpenGene/fastp/issues/133 in https://github.com/OpenGene/fastp/commit/e01e9402c3d5afded49b21c8303be51d7cbb2d27.

Maybe this gives @sfchen an idea what's happening or maybe it helps someone else that has time to tackle this.

rocpengliu commented 3 years ago

Hi, I have the similar error.

But my fastq file is the windows format, which has endings of \r\n I have removed any mBuf[end-1]=='\r' or mBuf[end]=='\r' in getLine() and it works well