BioJulia / XAM.jl

Parse and process SAM and BAM formatted files
MIT License
27 stars 13 forks source link

Add new interfaces for manipulating auxilary BAM fields #17

Open jakobnissen opened 5 years ago

jakobnissen commented 5 years ago

Add new interfaces for manipulating auxilary BAM fields

I've been working a bit with BAM files using BioAlignments the past days and have been looking through the source code. During the work, there was a few function I found that I missed from BioAlignments which would have my work easier. They would all be quite easy to implement, and I'd love to pitch a PR with the new interfaces + tests, but being interfaces, they're hard to change once released and very annoying to get wrong. So, let's discuss them first!

Phase out AuxData, or make it view-based.

Currently, an AuxData object can be created from the auxiliary data of a BAM record. In the current implementation It necessitates a copy of the data from the record, meaning that the user can manipulate the AuxData and seemingly succeeding, but without the record actually changing. Furthermore, most interfaces (all except iterate, AFAIK) for aux fields is currently implemented to work on Vector{UInt8} or on the records directly, making the AuxData type mostly obsolete. I suggest removing the object alltogether to avoid code duplication of functions acting on AuxData and Records and confusion about what the user is mutating. If this is not acceptable, then making the AuxData view-based.

get(record::BAM.Record, key::AbstractString, default)

Records work sort of like AbstractDict. In fact, the implemented AuxData type is a subtype of AbstractDict. But asking for a value with a default is very awkward for the user. I suggest adding this method, which is to completely mirror how it works for Dicts.

delete!(record::BAM.Record, key::AbstractString)

Actually changing the auxiliary fields of BAM.Record objects is difficult. Again, this one should simply mirror the similar method for Dicts. One could also think about adding pop!.

setindex!(record::BAM.Record, key::AbstractString, value)

This would also be useful to have. The value would be encoded and appended to the end of the data array. Probably a few different methods would have to be written, since the type of Value should only be an UInt8, Int32, Float32, String, or Vector{T} where T can be UInt8, Int8, UInt16, Int16, UInt32, Int32 or Float32.

Some new interface for making headers more easy

People can make all kinds of custom headers, but usually they just want the @SQ headers updated plus whatever non-SQ headers the program created. It's worth noting that BioAlignments make no checks for the header and records matching together, gladly writing a healthy-looking but unparsable (by htslib) BAM.file, so making it easy to not mess up the headers is rather important. I suggest making it easy to update the SQ headers. I'd like any suggestion here. I can imagine two new constructors for SAM.Header, namely SAM.Header(h::Sam.Header, records::Vector{BAM.Record}) and SAM.Header(records::Vector{BAM.Record}), the latter creating a header with the appropriate "SQ" headers, the former with the same, but keeping any non-SQ tags already present.

jakobnissen commented 1 week ago

Once Julia 1.10 support is dropped (potentially years from now), we should use https://github.com/BioJulia/XAMAuxData.jl for this.

kescobo commented 1 week ago

I think if there's a reasonable case to be made, we don't need to go on supporting LTS with future versions.

If there's someone that comes saying "I absolutely need LTS", they can make the case, but I don't think it's worth accommodating that hypothetical user if there are concrete benefits to actual users right now.