Closed bdeonovic closed 7 years ago
The exception is thrown here (https://github.com/BioJulia/BGZFStreams.jl/blob/7c65d67451485e48a076c1907eb996a169295ba1/src/bgzfstream.jl#L403-L404), which checks the magic bytes of gzip-compressed data stream. This may be caused by a corrupted file or by a bug of the BAM/BGZF reader. I'd like to check the file if possible.
I'm not sure how to provide you with the file, since the BAM file is over 100gb. When I made a small truncated version of the BAM file I did not have any issues with opening a reader.
Ugh, okay. Is that file is the only one in which you see this problem or does this happen often for such large files?
I've tried BAMReader with only 3 BAMs so far. The first one which was small (2.8GB) worked fine. The other two (138GB and 220GB) had the same error above.
The first came from GMAP, the other two came from hisat2.
I've never tried such large files (>100GB) before. I'll investigate them soon.
I downsampled (using samtools view -b -s 0.1
) the 138GB file down to 2.3GB. I was able to open it then. Perhaps issue with very large files?
Also tried on a 29GB file and it didn't work.
I'd really like to get this working, but I am not sure how to help. Anybody have an idea?
Sorry for being late. I applied BAMReader to a large file of 43GB (~1.5 billion records, the largest file I can access immediately) but I couldn't reproduce the problem. Just scanning a large file is not the cause of this problem I think. Can you paste a small reproducible code for me?
So I was a bit skeptical about the file size so I did Samtools view -b large.bam > large2.bam
and the resulting file was magnitudes smaller! The one I tested just went from 29GB to 1.6GB. What happened? Also I can access it now. Perhaps it got sorted which reduced file size after compression?
I cannot believe such size reduction is achievable. Are your BAM files really BAM? Can you show me the head of hexdump like below?
~/.j/v/Bio (masked-seq↑2|…) $ hexdump -C test/BioFmtSpecimens/BAM/bam1.bam | head
00000000 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 |............BC..|
00000010 5e 00 73 72 f4 65 34 66 60 60 70 08 0e e4 0c f6 |^.sr.e4f``p.....|
00000020 b3 32 e4 f4 f1 b3 32 32 b6 b4 34 31 e0 72 08 70 |.2....22..41.r.p|
00000030 e7 f4 74 b1 4a 2a 4f e4 0c f0 03 53 61 7e 56 06 |..t.J*O....Sa~V.|
00000040 7a 66 7a 46 ba 45 86 46 66 5c 8c 40 6d 4c 40 6c |zfzF.E.Ff\.@mL@l|
00000050 c8 e0 b2 92 99 01 00 92 e2 da 4d 49 00 00 00 1f |..........MI....|
00000060 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 9e |...........BC...|
00000070 44 bd bd cb 92 dc 48 b6 2d 16 5d 66 75 c8 66 9d |D.....H.-.]fu.f.|
00000080 6a 26 3c 9d 4c 8f 07 1e ee 70 00 9e 03 49 4c 66 |j&<.L....p...ILf|
00000090 92 4c b6 69 50 40 43 76 70 4d b7 25 dd 87 a4 36 |.L.iP@CvpM.%...6|
The 29GB file:
00000000 40 48 44 09 56 4e 3a 31 2e 30 09 53 4f 3a 75 6e |@HD.VN:1.0.SO:un|
00000010 73 6f 72 74 65 64 0a 40 53 51 09 53 4e 3a 45 4e |sorted.@SQ.SN:EN|
00000020 53 54 30 30 30 30 30 34 35 36 33 32 38 2e 32 7c |ST00000456328.2||
00000030 45 4e 53 47 30 30 30 30 30 32 32 33 39 37 32 2e |ENSG00000223972.|
00000040 35 7c 4f 54 54 48 55 4d 47 30 30 30 30 30 30 30 |5|OTTHUMG0000000|
00000050 30 39 36 31 2e 32 7c 4f 54 54 48 55 4d 54 30 30 |0961.2|OTTHUMT00|
00000060 30 30 30 33 36 32 37 35 31 2e 31 7c 44 44 58 31 |000362751.1|DDX1|
00000070 31 4c 31 2d 30 30 32 7c 44 44 58 31 31 4c 31 7c |1L1-002|DDX11L1||
00000080 31 36 35 37 7c 70 72 6f 63 65 73 73 65 64 5f 74 |1657|processed_t|
00000090 72 61 6e 73 63 72 69 70 74 7c 09 4c 4e 3a 31 36 |ranscript|.LN:16|
The 1.6GB file
00000000 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 |............BC..|
00000010 f0 2d a5 7d db ae 1d c7 91 25 fa d1 f3 2f 87 c8 |.-.}.....%.../..|
00000020 8c c8 ab 9f fa 90 94 6c b5 49 ea 0c 49 d9 fd 66 |.......l.I..I..f|
00000030 78 24 4d 43 c0 98 32 24 f5 c3 00 f5 0b f3 cf 13 |x$MC..2$........|
00000040 91 75 cb cb da 5b bb aa 6c 08 a8 cb e1 ce 55 91 |.u...[..l.....U.|
00000050 91 91 71 cf d7 cf ef ff ed 0f d3 ff fb b7 7f ff |..q.............|
00000060 f3 db 3f fc f5 c3 1f ed 2b f3 87 4f df fe f1 bf |..?.....+..O....|
00000070 bf fc fa f3 2f bf fd f8 c3 ff f8 f7 4f ff f3 0f |..../.......O...|
00000080 9f 3e fc f1 ab 0f 9f 3e 1b fd 9f f3 81 29 bd a2 |.>.....>.....)..|
00000090 49 9e fc a9 3c 21 e2 1c e9 95 9f be fd fc f9 cf |I...<!..........|
and using samtools view [bamfile] | wc
I got the same number of lines in each file. The reference names are very long ugly names (I aligned to a transcriptome):
SRSIM10003 83 ENST00000492467.5|ENSG00000204842.14|OTTHUMG00000133475.8|OTTHUMT00000257359.5|ATXN2-009|ATXN2|887|protein_coding| 235 1 150M = 235 -150 GAGTTTCTGCTGGGAGGGGTTCCATATCCAGTGGCCTAGAATTTGTATCCCACAACCCACCCAGTGAAGCAGCTACTCCTCCAGTAGCAAGGACCAGTCCCTCGGGGGGAACGTGGTCATCAGTGGTCAGTGGGGCTAAAGATTCCAGGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:0 YT:Z:CP NH:i:2
SRSIM10003 163 ENST00000492467.5|ENSG00000204842.14|OTTHUMG00000133475.8|OTTHUMT00000257359.5|ATXN2-009|ATXN2|887|protein_coding| 235 1 150M = 235 -150 GAGTTTCTGCTGGGAGGGGTTCCATATCCAGTGGCCTAGAATTTGTATCCCACAACCCACCCAGTGAAGCAGCTACTCCTCCAGTAGCAAGGACCAGTCCCTCGGGGGGAACGTGGTCATCAGTGGTCAGTGGGGCTAAAGATTCCAGGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:0 YT:Z:CP NH:i:2
SRSIM10003 339 ENST00000550236.1|ENSG00000204842.14|OTTHUMG00000133475.8|OTTHUMT00000404988.3|ATXN2-019|ATXN2|704|protein_coding| 490 1 150M = 490 -150 GAGTTTCTGCTGGGAGGGGTTCCATATCCAGTGGCCTAGAATTTGTATCCCACAACCCACCCAGTGAAGCAGCTACTCCTCCAGTAGCAAGGACCAGTCCCTCGGGGGGAACGTGGTCATCAGTGGTCAGTGGGGCTAAAGATTCCAGGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:0 YT:Z:CP NH:i:2
SRSIM10003 419 ENST00000550236.1|ENSG00000204842.14|OTTHUMG00000133475.8|OTTHUMT00000404988.3|ATXN2-019|ATXN2|704|protein_coding| 490 1 150M = 490 -150 GAGTTTCTGCTGGGAGGGGTTCCATATCCAGTGGCCTAGAATTTGTATCCCACAACCCACCCAGTGAAGCAGCTACTCCTCCAGTAGCAAGGACCAGTCCCTCGGGGGGAACGTGGTCATCAGTGGTCAGTGGGGCTAAAGATTCCAGGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:0 YT:Z:CP NH:i:2
The former is SAM, not BAM. That's the reason why you couldn't read it. I guess you accidentally created it with the .bam extension.
AHHH! Thank you! totally makes sense now
First few lines using
samtools view alz.sgs.bam
When I truncated my BAM to just be the first few hundred reads I don't have this error. So it must occur somewhere deeper in the BAM. Any idea of how to handle this?