Open jakobnissen opened 4 years ago
Changed printing of records: Before, empty record:
XAM.BAM.Record:
template name: nothing
flag: 4
reference ID: nothing
position: nothing
mapping quality: nothing
CIGAR:
next reference ID: nothing
next position: nothing
template length: nothing
sequence: nothing
base quality: nothing
auxiliary data:
After, empty record
XAM.BAM.Record:
template name:
flag: 0x0004
reference ID:
position:
mapping quality:
CIGAR:
next reference ID:
next position:
template length:
sequence:
base quality:
auxiliary data:
Before, filled record:
XAM.BAM.Record:
template name: HWI-1KL120:88:D0LRBACXX:1:1101:2205:2204
flag: 83
reference ID: 1
position: 132616
mapping quality: 60
CIGAR: 101M
next reference ID: 1
next position: 132491
template length: 226
sequence: GGTCCCACCTTGTCCTCCTCCTACACATACTCGGATGCTTCCTCCTCAACCTTGGCACCCACCTCCTTCTTACTGGGCCCAGGAGCCTTCAATGCCCAGGA
base quality: UInt8[0x21, 0x21, 0x21, 0x1f, 0x1e, 0x22, 0x1d, 0x1e, 0x1e, 0x1e … 0x27, 0x25, 0x25, 0x25, 0x25, 0x25, 0x25, 0x22, 0x22, 0x22]
auxiliary data: XT=U NM=3 SM=37 AM=37 X0=1 X1=0 XM=3 XO=0 XG=0 MD=4A4C82A8
After, filled record:
XAM.BAM.Record:
template name: "HWI-1KL120:88:D0LRBACXX:1:1101:2205:2204"
flag: 0x0053
reference ID: 1
position: 132616
mapping quality: 0x3c
CIGAR: "101M"
next reference ID: 1
next position: 132491
template length: 226
sequence: GGTCCCACCTTGTCCTCCTCCTACACAT…CTGGGCCCAGGAGCCTTCAATGCCCAGGA
base quality: ▆▆▆▆▆▆▄▆▆▆▆▆▆▇▇▇▇▇▇▇▆▆▄▆▄▄▆▆…▇▇██████████████▇▇▇▇▇▇▇▇▇▇▆▆▆
auxiliary data: XT='U' NM=0x03 SM=0x25 AM=0x25 X0=0x01 X1=0x00 XM=0x03 XO=0x00 XG=0x00 MD="4A4C82A8"
Should we even have has-functions, or just return a specific sentinel value when the record doesn't have the information available, e.g. nothing? The latter may be neater.
I think the has and is functions that provide an interpretation of the data/record are useful; the clear case being the flag field. However, the has functions are not necessary for checking the presence of a value in a mandatory field.
Would it be accurate to say that we're attempting to balance out the issue of consistency between SAM and BAM: whether it is useful to have consistency at the data-level or whether it is sufficient to have consistency at the print-level? Suppose we're to have better consistency between SAM and BAM, some default null values encountered need to be translated. If these translations occur at the data-level, they can introduce a form of double-checking (#26), which I'm fine with if we're clear that we want consistency between SAM and BAM at the data-level where possible.
In any case, I generally think accessors should assume the presence of value.
Also, once the raw byte data of a record has been interpreted, I'm not wedded to the idea that the record must maintain an exact byte structure of the original.
Which sentinel value? I'm leaning towards nothing (because it so qucikly errors if you dont handle it correctly, that leads to fewer bugs for the end user), but if you want missing, that's fine as well.
I would now lean towards nothing too -- the use of missing was annoying when chaining field comparisons.
BTW, I like that base quality print out.
I think it's useful to think in terms of three levels:
The issue with the API level is that there may be a conflict between having an easy API and having a performant one. In some circumstances it may be too difficult to NOT expose some of the underlying data, even though it shouldnt be relevant at the API level.
Actually I think the best thing may be to just make a new branch and play around a bit with it. Then I'll try to use the package in some small test code and get a feel for it.
You present the levels well.
I wonder whether we gain anything by not locking ourselves into the spec at the data-level?
To elaborate, there are the raw data files as per BAM and SAM specs, and our interpretation of the raw data which we keep in our struct. We could translate the raw data (null defaults) as they are loaded into our struct and then again if they are written to file.
If we can rule out translation at that level, then the remaining spot for translation is in the accessors and setters.
I don't think there is anything to gain by pushing the translation to the bottom layer. It increases complexity and introduces a performance penalty in that the fields would be translated unnecessarily for records that after a primary interrogation of the flag field would be discarded. It makes sense to defer translation to if and when you require the field data. I'm not particularly bothered by the possibility of repeated translation with multiple field access, as these fields in my experience are single-use, wouldn't you agree @jakobnissen?
Yes, I agree. There's also something rather nice about having the layout of a BAM struct be exactly equal to its representation in a file.
See also #24 and #25 .
This is a work in progress. In particular, I will first change the BAM code and get that right, and then afterwards change the SAM code to behave identically. Ideally, I would like SAM and BAM records to behave almost exactly identical, but that might not be feasible, let's see.
Changes: (updated as I go along):
hasposition
to always return a Boolisfilled
andhasflag
has been removedThings to mull over
has
-functions, or just return a specific sentinel value when the record doesn't have the information available, e.g.nothing
? The latter may be neater.nothing
(because it so qucikly errors if you dont handle it correctly, that leads to fewer bugs for the end user), but if you wantmissing
, that's fine as well.