BioJulia / XAM.jl

Parse and process SAM and BAM formatted files
MIT License
27 stars 13 forks source link

[WIP] convert SAM to BAM #61

Closed jonathanBieler closed 6 months ago

jonathanBieler commented 1 year ago

I'm trying to convert SAM record to a BAM one. I think I got the data layout roughly correct but the cigar and sequences need to be encoded differently or something. For context I now have paired read alignement working in BurrowsWheelerAligner.jl, so it would be possible to do a FASTQ to BAM & results directly in Julia. In general it would be nice to have more support for creating BAM alignment from scratch, modify them, etc. (https://discourse.julialang.org/t/change-record-sequence-in-a-bam-file-using-xam/98141/2) and I think converting SAM to BAM is a good starting point.

Any help would be appreciated.

sam = "seq1\t81\tPhiX\t1051\t60\t70M\t=\t1821\t702\tTCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC\t*\tNM:i:0\tMD:Z:70\tMC:Z:70M\tAS:i:70\tXS:i:0"

r = XAM.SAM.Record(sam)

cigar_ops = ('M','I','D','N','S','H','P','=','X')

br = BAM.Record()

# fixed-length fields (see BMA specs for the details)
br.refid = 1 #TODO : get from header
br.pos = SAM.position(r)
br.l_read_name = SAM.tempname(r) |> x -> length(x)+1 #why do I need + 1 ?
br.mapq = SAM.mappingquality(r)
#br.bin
br.n_cigar_op = sum(x ∈ cigar_ops for x in SAM.cigar(r))
br.flag = SAM.flag(r)
br.l_seq = SAM.seqlength(r)
br.next_refid = 1 #TODO : get from header
br.next_pos = SAM.nextposition(r)
br.tlen = SAM.templength(r)

# variable length data

t = x -> unsafe_wrap(Vector{UInt8}, x)

br.data = vcat(
    SAM.tempname(r) |> t,
    SAM.cigar(r) |> t,
    SAM.sequence(String, r) |> t,
    fill(0x41, length(SAM.sequence(r))),
    #SAM.quality(r),
    reduce(vcat, r.data[f] for f in r.fields) # aux data
)
br.block_size = length(br.data)

br
XAM.BAM.Record:
      template name: seq1
               flag: 81
       reference ID: 2
           position: 1052
    mapping quality: 60
              CIGAR: 70599891M
  next reference ID: 2
      next position: 1822
    template length: 702
           sequence: RGRGGVGVGMRGRGGMGMRGRGGVGMRGGVGVRGGMGAGVGARGRGGVGVRGGMGVRGGMRGRGGARGRG
       base quality: UInt8[0x41, 0x43, 0x43, 0x41, 0x54, 0x54, 0x54, 0x43, 0x41, 0x41  …  0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41]
     auxiliary data: AA=A AA=A AA=A AA=A AA=A AA=A AA=A AA=AERROR: invalid type tag: 'M'
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] auxtype
    @ ~/.julia/packages/XAM/l6LT2/src/bam/auxdata.jl:63 [inlined]
  [3] loadauxtype
    @ ~/.julia/packages/XAM/l6LT2/src/bam/auxdata.jl:82 [inlined]
kescobo commented 1 year ago

Just to make sure this doesn't get lost in the slackhole, @jakobnissen said:

You need to encode the sequences differently. BAM uses the same sequence encoding as LongDNA{4} from BioSequences.jl

But yes, I've been wanting to do a rehaul of XAM.jl for some time now. It's a little tricky to use correctly.

A few more points:

  • l_read_name has +1, because the read name needs to be NULL-terminated in the BAM record
  • Your CIGAR function does not work: 1M1X1M is different from 2M1X .
  • block_size also includes the length of the fields in the BAM record itself, not only the vector