BioJulia / Bio.jl

[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
http://biojulia.dev
MIT License
261 stars 65 forks source link

ERROR: BGZFStreams.BGZFDataError("invalid gzip identifier") #384

Closed bdeonovic closed 7 years ago

bdeonovic commented 7 years ago
julia> reader = open(BAMReader, "/Users/bdeonovic/tgs/kallisto/alz.sgs.bam")
ERROR: BGZFStreams.BGZFDataError("invalid gzip identifier")
 in read_bgzf_block!(::IOStream, ::Array{UInt8,1}) at /Users/bdeonovic/.julia/v0.5/BGZFStreams/src/bgzfstream.jl:404
 in read_blocks!(::BGZFStreams.BGZFStream{IOStream}) at /Users/bdeonovic/.julia/v0.5/BGZFStreams/src/bgzfstream.jl:360
 in ensure_buffered_data at /Users/bdeonovic/.julia/v0.5/BGZFStreams/src/bgzfstream.jl:323 [inlined]
 in read(::BGZFStreams.BGZFStream{IOStream}, ::Type{UInt8}) at /Users/bdeonovic/.julia/v0.5/BGZFStreams/src/bgzfstream.jl:239
 in init_bam_reader(::IOStream) at /Users/bdeonovic/.julia/v0.5/Bio/src/align/hts/bam/reader.jl:77
 in #BAMReader#32(::Void, ::Type{T}, ::IOStream) at /Users/bdeonovic/.julia/v0.5/Bio/src/align/hts/bam/reader.jl:30
 in #open#2(::Array{Any,1}, ::Function, ::Type{Bio.Align.BAMReader}, ::String) at /Users/bdeonovic/.julia/v0.5/Bio/src/IO.jl:58
 in open(::Type{Bio.Align.BAMReader}, ::String) at /Users/bdeonovic/.julia/v0.5/Bio/src/IO.jl:58

First few lines using samtools view alz.sgs.bam

K00275:35:HG5YTBBXX:8:1101:1428:1156    77      *       0       0       *       *       0       0       NTCCTTCTCAAATTTTTCAATGGTTCTTTTGTCGANGCCACCGCATTTATAGATCAGATGGCCAGTAGTGGTGGACTTGCCCGAATCTACGTGTCCAAGGATGACAATGTGGCTAAGAGAGATCGGAAAAGGACACAGCGGAACACGAGNC #AAFAF-JFJJJJFFFJJFJJF<JFJFF-FJ-F77#<A-A<<-77-<-A-7-F-A<-<--<7F7-77AF-F-----7AFA<FJJJ7-<AFJ-F-777F-77-7-FJA-<-7<-77--7A-F<77--------7A----)-7)-7-7-<7#< YT:Z:UP
K00275:35:HG5YTBBXX:8:1101:1428:1156    141     *       0       0       *       *       0       0       NTCNTATCAACATTGTCGTCATTGGACACGTAGATTCGGGCAAGTCCACCACTACTGGCCATCTGATCTATAAATGCGGTGGCATCGACAAAAGAACCATTGAAAAATTTGANAAGGAGAGATCGGAAGAGCGTCGTGTAGGGGAAGAGTG #AA#AA7FJFFJFFJJJJJ<FJFFJJJ<J--FJAJJJFAFFAAJ<AAFFJA<JFJJJJ<AJFAFFF-7<-AJFJAFJ7J--AJFFJAJJJ<FJJJFJJAA-7AFFJFF-7-A#FF-<---)<A<---7)--7)))<<7----))7A)-7-- YT:Z:UP
K00275:35:HG5YTBBXX:8:1101:1631:1156    83      NM_001288716    3131    0       150M1S  =       3053    -229    ANTAGAAGGCTAAAGTGATCCTAAACATGACAGTGGGCAGCACTTTTCTAGCGTGAGCTGTAGAGTAACGAGAAGTGCTTTATACTGAACGTGGTTGATGGGAGGAGAGACGAGGNATTCGGGCCGGTGGGGCGTAAGTGTTATCGTTAAN A#77)---F7---<A---FJF<AFA-F-A-F7F-A-AA-FA-A-<77-7-F7JJJA<-AFJJJJFFJAAJAJFJJJFJJJF7<<-F<-7-JJJF-FAJA<FJJJFA<AF7-JFJF#<J7AJJJJJJJJFJJJJJFFJJJJJJJAFJFFAA# AS:i:-20        ZS:i:-20        XN:i:0  XM:i:6  XO:i:0  XG:i:0  NM:i:6  MD:Z:1G5A17T17C71C22G11 YS:i:-21        YT:Z:CP NH:i:3
K00275:35:HG5YTBBXX:8:1101:1631:1156    163     NM_001288716    3053    0       1S147M3S        =       3131    229     NGAAGGCTGTGATAACCAGTAAGGAAATATTAAGAGCTATTTTAGAAAGCTAAATGAATCGCGAGTTAACTTGGAAATCAGTAGAAAGCTAAAGTGATCCTAAATATGACAGNGGGCAGCACCTTTCGAGCGTAAGCGGTAGAGAAACCAA #AAAAJ-FJF<JAJJFJJFJJJJJJFFF-J-<FAJJJJJF7F<JJJJJJJFJJJFJJJJ7JJ-7AJ-JFJAFJJJJFJJ7FA<FJJF7F-AA<F-7F7F--AAF-<FAA--<#<JAF<<AAA))----<<A<--7-7--<))AA-777-7- AS:i:-21        ZS:i:-21        XN:i:0  XM:i:6  XO:i:0  XG:i:0  NM:i:6  MD:Z:61A49T14T5G3T6T3   YS:i:-20        YT:Z:CP NH:i:3
K00275:35:HG5YTBBXX:8:1101:1631:1156    339     NM_001288717    3496    0       150M1S  =       3418    -229    ANTAGAAGGCTAAAGTGATCCTAAACATGACAGTGGGCAGCACTTTTCTAGCGTGAGCTGTAGAGTAACGAGAAGTGCTTTATACTGAACGTGGTTGATGGGAGGAGAGACGAGGNATTCGGGCCGGTGGGGCGTAAGTGTTATCGTTAAN A#77)---F7---<A---FJF<AFA-F-A-F7F-A-AA-FA-A-<77-7-F7JJJA<-AFJJJJFFJAAJAJFJJJFJJJF7<<-F<-7-JJJF-FAJA<FJJJFA<AF7-JFJF#<J7AJJJJJJJJFJJJJJFFJJJJJJJAFJFFAA# AS:i:-20        ZS:i:-20        XN:i:0  XM:i:6  XO:i:0  XG:i:0  NM:i:6  MD:Z:1G5A17T17C71C22G11 YS:i:-21        YT:Z:CP NH:i:3

When I truncated my BAM to just be the first few hundred reads I don't have this error. So it must occur somewhere deeper in the BAM. Any idea of how to handle this?

bicycle1885 commented 7 years ago

The exception is thrown here (https://github.com/BioJulia/BGZFStreams.jl/blob/7c65d67451485e48a076c1907eb996a169295ba1/src/bgzfstream.jl#L403-L404), which checks the magic bytes of gzip-compressed data stream. This may be caused by a corrupted file or by a bug of the BAM/BGZF reader. I'd like to check the file if possible.

bdeonovic commented 7 years ago

I'm not sure how to provide you with the file, since the BAM file is over 100gb. When I made a small truncated version of the BAM file I did not have any issues with opening a reader.

bicycle1885 commented 7 years ago

Ugh, okay. Is that file is the only one in which you see this problem or does this happen often for such large files?

bdeonovic commented 7 years ago

I've tried BAMReader with only 3 BAMs so far. The first one which was small (2.8GB) worked fine. The other two (138GB and 220GB) had the same error above.

The first came from GMAP, the other two came from hisat2.

bicycle1885 commented 7 years ago

I've never tried such large files (>100GB) before. I'll investigate them soon.

bdeonovic commented 7 years ago

I downsampled (using samtools view -b -s 0.1) the 138GB file down to 2.3GB. I was able to open it then. Perhaps issue with very large files?

bdeonovic commented 7 years ago

Also tried on a 29GB file and it didn't work.

bdeonovic commented 7 years ago

I'd really like to get this working, but I am not sure how to help. Anybody have an idea?

bicycle1885 commented 7 years ago

Sorry for being late. I applied BAMReader to a large file of 43GB (~1.5 billion records, the largest file I can access immediately) but I couldn't reproduce the problem. Just scanning a large file is not the cause of this problem I think. Can you paste a small reproducible code for me?

bdeonovic commented 7 years ago

So I was a bit skeptical about the file size so I did Samtools view -b large.bam > large2.bam and the resulting file was magnitudes smaller! The one I tested just went from 29GB to 1.6GB. What happened? Also I can access it now. Perhaps it got sorted which reduced file size after compression?

bicycle1885 commented 7 years ago

I cannot believe such size reduction is achievable. Are your BAM files really BAM? Can you show me the head of hexdump like below?

~/.j/v/Bio (masked-seq↑2|…) $ hexdump -C test/BioFmtSpecimens/BAM/bam1.bam | head
00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  5e 00 73 72 f4 65 34 66  60 60 70 08 0e e4 0c f6  |^.sr.e4f``p.....|
00000020  b3 32 e4 f4 f1 b3 32 32  b6 b4 34 31 e0 72 08 70  |.2....22..41.r.p|
00000030  e7 f4 74 b1 4a 2a 4f e4  0c f0 03 53 61 7e 56 06  |..t.J*O....Sa~V.|
00000040  7a 66 7a 46 ba 45 86 46  66 5c 8c 40 6d 4c 40 6c  |zfzF.E.Ff\.@mL@l|
00000050  c8 e0 b2 92 99 01 00 92  e2 da 4d 49 00 00 00 1f  |..........MI....|
00000060  8b 08 04 00 00 00 00 00  ff 06 00 42 43 02 00 9e  |...........BC...|
00000070  44 bd bd cb 92 dc 48 b6  2d 16 5d 66 75 c8 66 9d  |D.....H.-.]fu.f.|
00000080  6a 26 3c 9d 4c 8f 07 1e  ee 70 00 9e 03 49 4c 66  |j&<.L....p...ILf|
00000090  92 4c b6 69 50 40 43 76  70 4d b7 25 dd 87 a4 36  |.L.iP@CvpM.%...6|
bdeonovic commented 7 years ago

The 29GB file:

00000000  40 48 44 09 56 4e 3a 31  2e 30 09 53 4f 3a 75 6e  |@HD.VN:1.0.SO:un|
00000010  73 6f 72 74 65 64 0a 40  53 51 09 53 4e 3a 45 4e  |sorted.@SQ.SN:EN|
00000020  53 54 30 30 30 30 30 34  35 36 33 32 38 2e 32 7c  |ST00000456328.2||
00000030  45 4e 53 47 30 30 30 30  30 32 32 33 39 37 32 2e  |ENSG00000223972.|
00000040  35 7c 4f 54 54 48 55 4d  47 30 30 30 30 30 30 30  |5|OTTHUMG0000000|
00000050  30 39 36 31 2e 32 7c 4f  54 54 48 55 4d 54 30 30  |0961.2|OTTHUMT00|
00000060  30 30 30 33 36 32 37 35  31 2e 31 7c 44 44 58 31  |000362751.1|DDX1|
00000070  31 4c 31 2d 30 30 32 7c  44 44 58 31 31 4c 31 7c  |1L1-002|DDX11L1||
00000080  31 36 35 37 7c 70 72 6f  63 65 73 73 65 64 5f 74  |1657|processed_t|
00000090  72 61 6e 73 63 72 69 70  74 7c 09 4c 4e 3a 31 36  |ranscript|.LN:16|

The 1.6GB file

00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  f0 2d a5 7d db ae 1d c7  91 25 fa d1 f3 2f 87 c8  |.-.}.....%.../..|
00000020  8c c8 ab 9f fa 90 94 6c  b5 49 ea 0c 49 d9 fd 66  |.......l.I..I..f|
00000030  78 24 4d 43 c0 98 32 24  f5 c3 00 f5 0b f3 cf 13  |x$MC..2$........|
00000040  91 75 cb cb da 5b bb aa  6c 08 a8 cb e1 ce 55 91  |.u...[..l.....U.|
00000050  91 91 71 cf d7 cf ef ff  ed 0f d3 ff fb b7 7f ff  |..q.............|
00000060  f3 db 3f fc f5 c3 1f ed  2b f3 87 4f df fe f1 bf  |..?.....+..O....|
00000070  bf fc fa f3 2f bf fd f8  c3 ff f8 f7 4f ff f3 0f  |..../.......O...|
00000080  9f 3e fc f1 ab 0f 9f 3e  1b fd 9f f3 81 29 bd a2  |.>.....>.....)..|
00000090  49 9e fc a9 3c 21 e2 1c  e9 95 9f be fd fc f9 cf  |I...<!..........|

and using samtools view [bamfile] | wc I got the same number of lines in each file. The reference names are very long ugly names (I aligned to a transcriptome):

SRSIM10003      83      ENST00000492467.5|ENSG00000204842.14|OTTHUMG00000133475.8|OTTHUMT00000257359.5|ATXN2-009|ATXN2|887|protein_coding|      235     1       150M    =       235     -150    GAGTTTCTGCTGGGAGGGGTTCCATATCCAGTGGCCTAGAATTTGTATCCCACAACCCACCCAGTGAAGCAGCTACTCCTCCAGTAGCAAGGACCAGTCCCTCGGGGGGAACGTGGTCATCAGTGGTCAGTGGGGCTAAAGATTCCAGGC     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII     AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:150        YS:i:0  YT:Z:CP NH:i:2
SRSIM10003      163     ENST00000492467.5|ENSG00000204842.14|OTTHUMG00000133475.8|OTTHUMT00000257359.5|ATXN2-009|ATXN2|887|protein_coding|      235     1       150M    =       235     -150    GAGTTTCTGCTGGGAGGGGTTCCATATCCAGTGGCCTAGAATTTGTATCCCACAACCCACCCAGTGAAGCAGCTACTCCTCCAGTAGCAAGGACCAGTCCCTCGGGGGGAACGTGGTCATCAGTGGTCAGTGGGGCTAAAGATTCCAGGC     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII     AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:150        YS:i:0  YT:Z:CP NH:i:2
SRSIM10003      339     ENST00000550236.1|ENSG00000204842.14|OTTHUMG00000133475.8|OTTHUMT00000404988.3|ATXN2-019|ATXN2|704|protein_coding|      490     1       150M    =       490     -150    GAGTTTCTGCTGGGAGGGGTTCCATATCCAGTGGCCTAGAATTTGTATCCCACAACCCACCCAGTGAAGCAGCTACTCCTCCAGTAGCAAGGACCAGTCCCTCGGGGGGAACGTGGTCATCAGTGGTCAGTGGGGCTAAAGATTCCAGGC     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII     AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:150        YS:i:0  YT:Z:CP NH:i:2
SRSIM10003      419     ENST00000550236.1|ENSG00000204842.14|OTTHUMG00000133475.8|OTTHUMT00000404988.3|ATXN2-019|ATXN2|704|protein_coding|      490     1       150M    =       490     -150    GAGTTTCTGCTGGGAGGGGTTCCATATCCAGTGGCCTAGAATTTGTATCCCACAACCCACCCAGTGAAGCAGCTACTCCTCCAGTAGCAAGGACCAGTCCCTCGGGGGGAACGTGGTCATCAGTGGTCAGTGGGGCTAAAGATTCCAGGC     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII     AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:150        YS:i:0  YT:Z:CP NH:i:2
bicycle1885 commented 7 years ago

The former is SAM, not BAM. That's the reason why you couldn't read it. I guess you accidentally created it with the .bam extension.

bdeonovic commented 7 years ago

AHHH! Thank you! totally makes sense now