BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

ERROR: EOFError: read end of file #25

Closed godkin1211 closed 4 years ago

godkin1211 commented 4 years ago

Hi! I got an error when I tried to folloe the content of fastq.md to read a fastq file

Expected Behavior

I expect that I can read my fastq successfully when I use the sample codes from the doc ()

julia> reader = open(FASTQ.Reader, "mydata.fastq")
       record = FASTQ.Record()
       while !eof(reader)
           read!(reader, record)
       end

Current Behavior

I got an error

julia> open(FASTQ.Reader, "mydata.fastq") do reader
               record = FASTQ.Record()
               while !eof(reader)
                   read!(reader, record)
               end
         end

ERROR: EOFError: read end of file
Stacktrace:
 [1] read!(::FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}, ::FASTX.FASTQ.Record) at /home/godkin/.julia/packages/FASTX/Uoya3/src/fastq/reader.jl:46
 [2] (::var"#7#8")(::FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}) at ./REPL[42]:4
 [3] open(::var"#7#8", ::Type{FASTX.FASTQ.Reader}, ::String; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/godkin/.julia/packages/BioGenerics/cCuGr/src/IO.jl:48
 [4] open(::Function, ::Type{FASTX.FASTQ.Reader}, ::String) at /home/godkin/.julia/packages/BioGenerics/cCuGr/src/IO.jl:46
 [5] top-level scope at REPL[42]:1

Possible Solution / Implementation

I tried to use for-loop to read the fastq, and it works.

julia> reader = open(FASTQ.Reader, "mydata.fastq")
julia> records = FASTQ.Record[]
0-element Array{FASTX.FASTQ.Record,1}

julia> for rec in reader
           push!(records, copy(rec))
       end

Your Environment

Installed Packages

(nanojulia) pkg> status
Status `~/Projects/nanojulia/Project.toml`
  [c7e460c6] ArgParse v1.1.0
  [336ed68f] CSV v0.7.6
  [a93c6f00] DataFrames v0.21.5
  [c2308a5c] FASTX v1.1.3
  [b98c9c47] Pipe v1.3.0
jakobnissen commented 4 years ago

@godkin1211 Would it be possible for you to share the end of the file - in particular the last two records? I'm blanking on what could cause this to happen.

Edit: When you do, please be mindful to give the exact end of your file including any whitespace.

godkin1211 commented 4 years ago

@godkin1211 Would it be possible for you to share the end of the file - in particular the last two records? I'm blanking on what could cause this to happen.

Edit: When you do, please be mindful to give the exact end of your file including any whitespace.

@jakobnissen bellow is the last two records in my fastq file in which I've remove all whilespace:

@fc5012f5-146b-40e0-9e6a-d95ae70ef6bbrunid=482e8b3504bc22862526cd3a2f2f2d2cef570478read=866ch=137start_time=2020-05-21T04:24:45Zflow_cell_id=FAN31322protocol_group_id=0521sample_id=3586
TGTTGTACTTCGTTCAGTTACGTATTGCTCACGTGTGAAAGAATTGGTGTATGCAGAGGTAAGTAGGTTCTAGTTTGTAGATTAACACACACGACTAGAGACTAGTGGCAATAAAACAAGAAGAAACAAACATTGTTCGTTTAGTTGTTAACAAGAACATCACTAGAAATAACAACTCTATTTGTTTCTCACCAATTATAAGGTCTACCTTTACTAAGAAGAGATAAAAATCATATCATTGATTTGACCTTCTTTTAAAAGACATAACAGCAGTACCCATAATTTGAAATTTACTCATGTCAAATAAGAATAGGAAGACAACTAAGTTGGTTTGTGATATAAAATATGTGAATTTTGCATGCACATGACATAACCATCTATTTGTTTCGCGTGGTTTGCCAAGATAGACATCAGTAAAAATGCTTCAGATGATGACGTTCACATTAGTAACAAGGCAATGCGTA
+
#$'(&,%$$$&0/33/9:5)&#$$$%+%,9897@/3/.;&)+0/1)+91&&'+/-/(%'-&')1&%,9782/68>:',,)-040.+($+&(',05/)'(2-1,*'4-.0011>@555),)/3*4<<;@@96<A@9D<@3JH>7<@E<>?59::;:<?;44<:<6;74LCA77>89B>>/'?>A5>=3-.+)$)233765((0-522+.-276-+5/120/55=B9HJ@@0(='84003214.58>?A>???95::.>?8+1>@5//2,2(,//.-..-)98:98.%'/6:=7/-.,-3,%8=CE@9:),3,(),/7;96/-)*'),-3&*065+-,%$$(8::<AE75-%'(37<?9@6++1..((*+*-(..2241:579:D?A<.6197B6:>@@122*+)2,((%*(%/0%1563'&%'--0@<<<;0/953(&(67/9..0,45114/1'5844&&*(%(
@7052050d-dade-401f-ae61-71b92f912682runid=482e8b3504bc22862526cd3a2f2f2d2cef570478read=967ch=125start_time=2020-05-21T04:24:44Zflow_cell_id=FAN31322protocol_group_id=0521sample_id=3586
CGGTGTACTTCGTTCAGTTACGTCTTCTTTCTCTGTTAACACTTCTGTAGGAAGTGTTCTCCCTCTAAAGAAGATAATTTGTGCCGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACACTTTCTGTACAATCCCCCTTTGAGTGCGTGACAAATGTTTCACCTAAATTCAAGACTTTAAAGTTTAAACTCCACCAATAATGATAGAAGTCCTTTACACAAAGCCAAAAATTTATTTCACAAGCTTAAGAAGTCTGAACACTCCCTTAATTTCCTTTGCACAGTGGTAAGATTTGATACCGACAATTTCACAAGCACAGGTTGAAGATAAATTTAACAATTTCCCAACCGTCTCGAAACTCTACACCTTCCACTTAAACTTCTCTTCAAGCCAATCAGGACGGGTTTGAGTTTTTCATAAACAATTATATGTTGGTTAGCCACTGTTATGAAAACAAACCCGTCTGGCGCCCCAGAAGGTATGAAACAAGTGTTAAATATTTTAATCGATGTGTCTAAGGCATTCTTCTTCTTTAATATACGGCTGTGTGACTTTGGTGGAGCTAAACAAAGCAATTTAGGTGTCACACTCAAGAGGATTATTTACAGAAAGTGTGTTAAATCGGAGAGAAACTATTTTGCATATACAAAAACTGAAATTATCTTCTTGAGGAAACTTGCTAAATGTTAACGAAGGCAAT
+
7&('#%((**(088*(17/%&&'&'(%&(2)7B=8=?3228----/1+'3($%)>8;53><@>1-0>?&9:1<>:()(('#'$$*).3)./553==81BE=49-735.1//)*.5??F?@5>6).)AA>/03/*)7;:@BBCA;<7632339;5,,//106@7**B>>?<>?..?@DB6;<2;>?C56*,-%577424-+591)+$86--45754644;:8,.',%*%%-%%%((564518<:@=.'**/1-==+%+-.2+(()8:',*%$,,,,38($'%3%%-3635<649<@984*)'6%&-$%$%)-,''&/000;=<=46;63347('&'%,>?;822-?<?>5;84$$(&(&05011679:7;=*,('++--6,.4=;>97%)&%*-797;=6648<+:68/4501443/?B>/0ACC=76C?=LJ;?6:AB7@),)726800=;(.2693331124,)$')%%%#'*-,59/*10&$&$'*)))3+0',',).-+)%$%&#*%())(&&&((((35230-,(%.-+,&')%+,-/,).((,#%%&&&$$$$&++(3)'%(+&%()&#$$&('&)+(*.000''&&+&)-(++6(.2)'%&//$&&%44-,+(350'*((-(',=8-=52%'+&4%*)(*%-.('/,,,--$''''&'&:;@.)&'&()%%.*1636;:-'$&($$%$$$$%$$&(.(-1))+)))'+2++$

Unfortunately, after removing whitespaces, I still got the same error.

godkin1211 commented 4 years ago

I solved this problem! The existence of '^M' symbol in this fastq file causes this error, and FASTX works after I remove these symbols.

jakobnissen commented 4 years ago

@godkin1211 Thank you for your help. This was indeed a bug in the FASTQ parser. Windows uses \r\n as newline, whereas MacOS and Linux uses \n. It appears we had simply not tested that the parser worked correctly for Windows newlines.

The bug is fixed in #28, and I have added a test to make sure this does not happen in the future.