enasequence / cramtools

CRAM format specification and java API for read data.
http://www.ebi.ac.uk/ena/about/cram_toolkit
Apache License 2.0
58 stars 21 forks source link

>= 129 auxiliary tags #29

Closed jkbonfield closed 9 years ago

jkbonfield commented 9 years ago

The htslib/test/xx#large_aux.sam has some sequences containing lots of auxiliary tags. If I cut it down to just the first read with 129 tags it fails, but at 128 it passes.

An example SAM file with 129 ??:i:1 entries is:

@SQ     SN:xx   LN:20
a1      16      xx      1       1       10M     *       0       0       AAAAAAAAAA      *     aa:i:1   ab:i:1  ac:i:1  ad:i:1  ae:i:1  af:i:1  ag:i:1  ah:i:1  ai:i:1  aj:i:1  ak:i:1  al:i:1am:i:1   an:i:1  ao:i:1  ap:i:1  aq:i:1  ar:i:1  as:i:1  at:i:1  au:i:1  av:i:1  aw:i:1  ax:i:1ay:i:1   az:i:1  ba:i:1  bb:i:1  bc:i:1  bd:i:1  be:i:1  bf:i:1  bg:i:1  bh:i:1  bi:i:1  bj:i:1bk:i:1   bl:i:1  bm:i:1  bn:i:1  bo:i:1  bp:i:1  bq:i:1  br:i:1  bs:i:1  bt:i:1  bu:i:1  bv:i:1bw:i:1   bx:i:1  by:i:1  bz:i:1  ca:i:1  cb:i:1  cc:i:1  cd:i:1  ce:i:1  cf:i:1  cg:i:1  ch:i:1ci:i:1   cj:i:1  ck:i:1  cl:i:1  cm:i:1  cn:i:1  co:i:1  cp:i:1  cq:i:1  cr:i:1  cs:i:1  ct:i:1cu:i:1   cv:i:1  cw:i:1  cx:i:1  cy:i:1  cz:i:1  da:i:1  db:i:1  dc:i:1  dd:i:1  de:i:1  df:i:1dg:i:1   dh:i:1  di:i:1  dj:i:1  dk:i:1  dl:i:1  dm:i:1  dn:i:1  do:i:1  dp:i:1  dq:i:1  dr:i:1ds:i:1   dt:i:1  du:i:1  dv:i:1  dw:i:1  dx:i:1  dy:i:1  dz:i:1  ea:i:1  eb:i:1  ec:i:1  ed:i:1ee:i:1   ef:i:1  eg:i:1  eh:i:1  ei:i:1  ej:i:1  ek:i:1  el:i:1  em:i:1  en:i:1  eo:i:1  ep:i:1eq:i:1   er:i:1  es:i:1  et:i:1  eu:i:1  ev:i:1  ew:i:1  ex:i:1  ey:i:1
jkb@seq3a[htslib/test] java -Xmx4000m -jar ~/work/cram/cramtools/cramtools-3.0.jar cram -R xx.fa -n -Q --capture-all-tags -I a.sam -O _.cram -l debug
DEBUG   2015-04-09 16:50:40     Bam2Cram        Creating tracks for reference: name=xx, length=20.

INFO    2015-04-09 16:50:40     FixBAMFileHeader        Reference sequence MD5 not found, calculating: xx
DEBUG   2015-04-09 16:50:40     Bam2Cram        Writing 1 records for sequence 0, xx
INFO    2015-04-09 16:50:40     Bam2Cram        create: tracks 0ms, records 8ms, mating 0ms.
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Assigned external id to bases: 0
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Assigned external id to quality scores: 1
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Assigned external id to read names: 2
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Assigned external id to mate info: 3
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Best encoding for NF_RecordsToNextFragment: GAMMA[1]
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Best encoding for read feature position: GAMMA[1]
DEBUG   2015-04-09 16:50:40     SubstitutionMatrix      Subs matrix: [27, 27, 27, 27, 27]: 0001101100011011000110110001101100011011
DEBUG   2015-04-09 16:50:40     SubstitutionMatrix      Subs matrix decoded: A:CGTN     C:AGTNG:ACTN   T:ACGN  N:ACGT
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Best encoding for HC_HardClip: GAMMA[0]
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        Best encoding for PD_padding: GAMMA[0]
DEBUG   2015-04-09 16:50:40     CompressionHeaderFactory        NS: HUFFMAN:[0x01 0xff 0xff 0xff 0xff 0xff 0x01 0x00]
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: java.nio.BufferUnderflowException
        at java.nio.Buffer.nextGetIndex(Buffer.java:492)
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:135)
        at net.sf.cram.io.ByteBufferUtils.readUnsignedITF8(ByteBufferUtils.java:305)
        at net.sf.cram.encoding.HuffmanByteEncoding.fromByteArray(HuffmanByteEncoding.java:72)
        at net.sf.cram.encoding.writer.DataWriterFactory.createWriter(DataWriterFactory.java:87)
        at net.sf.cram.encoding.writer.DataWriterFactory.buildWriter(DataWriterFactory.java:54)
        at net.sf.cram.build.ContainerFactory.buildSlice(ContainerFactory.java:194)
        at net.sf.cram.build.ContainerFactory.buildContainer(ContainerFactory.java:89)
        at net.sf.cram.build.ContainerFactory.buildContainer(ContainerFactory.java:58)
        at net.sf.cram.Bam2Cram.main(Bam2Cram.java:369)
        ... 6 more

This is strictly an encoder issue as cramtools can decode the data generated from htslib.

vadimzalunin commented 9 years ago

Sounds like signed bytes in java. ... Found a not used TC_tagCount encoding key, probably from old times. Even though it had no effect the CHF tried to set some encoding params to it, failing because TC was typed as byte... fix d32e1125b7a8dbd86f8fe7dd9f06358d4b0ad649

jkbonfield commented 9 years ago

On Fri, Apr 17, 2015 at 03:28:11AM -0700, Vadim Zalunin wrote:

Sounds like signed bytes in java. ... Found a not used TC_tagCount encoding key, probably from old times. Even though it had no effect the CHF tried to set some encoding params to it, failing because TC was typed as byte... fix d32e1125b7a8dbd86f8fe7dd9f06358d4b0ad649

Unfortunately the fix only partially works. It fixes the xx#large_aux.sam file correctly, but in doing so breaks compatibility with Cramtools 2.1 outputs:

jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-2.1.jar cram -R xx.fa -I xx#rg.sam -O _tmp.cram -n -Q --capture-all-tags jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-3.0.jar bam -R xx.fa -I _tmp.cram -O _tmp.sam Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.sf.cram.CramTools.invoke(CramTools.java:93) at net.sf.cram.CramTools.main(CramTools.java:123) Caused by: java.lang.RuntimeException: Unknown encoding key: TC at net.sf.cram.structure.CompressionHeader.read(CompressionHeader.java:173) at net.sf.cram.structure.CompressionHeader.read(CompressionHeader.java:120) at net.sf.cram.structure.CompressionHeaderBLock.(CompressionHeaderBLock.java:50) at net.sf.cram.build.CramIO.readContainer(CramIO.java:456) at net.sf.cram.build.CramIO.readContainer(CramIO.java:424) at net.sf.cram.build.CramIO.readContainer(CramIO.java:245) at net.sf.cram.Cram2Bam.main(Cram2Bam.java:213) ... 6 more

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

vadimzalunin commented 9 years ago

of course, it creeped into the 2.1 files already... The exception can be ignored i guess, what does scramble do if it encounters an unknown encoding key?

jkbonfield commented 9 years ago

On Fri, Apr 17, 2015 at 05:01:40AM -0700, Vadim Zalunin wrote:

of course, it creeped into the 2.1 files already... The exception can be ignored i guess, what does scramble do if it encounters an unknown encoding key?

It appears so yes! I agree it's probably fair to just ignore them. It's similar to having an extra auxiliary tag in SAM that you don't understand.

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

vadimzalunin commented 9 years ago

done: bd4c5fd202fe669d4b78837806f9fe581f070712

jkbonfield commented 9 years ago

Sorry Vadim, it doesn't seem to work still:

jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-2.1.jar cram -R xx.fa -I xx#rg.sam -O _tmp.cram -n -Q --capture-all-tags
jkb@seq3a[htslib/test] /software/bin/java -Xmx4000m -jar /nfs/users/nfs_j/jkb/work/cram/cramtools/cramtools-3.0.jar bam -R xx.fa -I _tmp.cram
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at net.sf.cram.CramTools.invoke(CramTools.java:93)
        at net.sf.cram.CramTools.main(CramTools.java:123)
Caused by: net.sf.samtools.SAMFormatException: Unrecognized tag type:
        at net.sf.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:403)
        at net.sf.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:325)
        at net.sf.cram.structure.SliceIO.parseSliceHeaderBlock(SliceIO.java:59)
        at net.sf.cram.structure.SliceIO.readSliceHeadBlock(SliceIO.java:39)
        at net.sf.cram.build.CramIO.readContainer(CramIO.java:467)
        at net.sf.cram.build.CramIO.readContainer(CramIO.java:424)
        at net.sf.cram.build.CramIO.readContainer(CramIO.java:245)
        at net.sf.cram.Cram2Bam.main(Cram2Bam.java:213)
        ... 6 more

If it's any commiseration, Rob's turn the bug-hose onto io_lib with a fuzz tester and is identifying lots of crashes!

vadimzalunin commented 9 years ago

seems to work for xx#rg.full.sam, is xx#rg.sam any different?

vadimzalunin commented 9 years ago

ah, you mean between versions...

vadimzalunin commented 9 years ago

fixed in cram3 branch. I think there is a bug in 2.1 which causes 3 extra bytes in the slice header, looks like int vs itf8.

jkbonfield commented 9 years ago

Ah sorry that was a cut and paste fail! Yes it works in version 3. I'll close this then. Thanks.