Open gszep opened 2 years ago
Dear @gszep
Interesting idea. I'm sure CenteredSequence
would be practical in some scenarios like the one you describe. However, from a practical point of view, I am nearly certain implementing CenteredSequence
would be an utter nightmare, and an unending source of bugs. This could be justified if CenteredSequence
was absolutely critical for BioSequences.jl
, but it's not. Therefore, I am against the proposal.
I don't like shooting down reasonable and well thought out proposals from users, so let me justify why I think CenteredSequence
would be hard to get right.
The issue is twofold. Most importantly, it violates the very first point of the BioSequence
interface, namely:
Its [i.e. BioSequences] subtypes are characterized by:
• Being a linear container type with random access and indices Base.OneTo(length(x)).
When we wrote this interface, we did think about whether it was a reasonable assumption. In particular, we thought of circular sequences like plasmids. However, we ended up accepting this limitation, because assuming fixed linear indices makes everything much simpler.
More fundamentally, when implementing any kind of abstract type, there must be a concept of an interface, which makes a set of assumptions that are not violated. Julia has historically been very bad at defining and enforcing these interfaces, which is part of the reason the language is so unstable. We need to do better than what we do already. You mention OffsetArrays, which has recently gained infamy due to its buggy behaviour - see https://yuri.is/not-julia/ , precisely because most AbstractArray
code assumes its indices are 1:length(x)
. It'd be even worse for CenteredSequence
, since this assumption is written down in the docstring of BioSequence
!
The second reason is, even if 1-based indexing was not part of the BioSequence interface, it's assumptions like that that enables generic code in the first place. Right now, we can implement most methods on BioSequence
, because its behaviour is reasonably well defined. If we could not assume sequences length, or that the length is even finite, or how boundschecking works, so much generic code goes out the window.
Now, sometimes, generic code can be re-introduced by introducing new abstractions (similar to Base's
nextind
, prevind
etc), which themselves make assumptions about behaviour. And sometimes, these new abstractions are worth it! BioSequences uses lots and lots of these internally. But every abstraction has the risk of being leaky, of being inefficient, or of not covering edge cases. For example, how do you get an iterator from the 10th element to the 5th to last element in an array where you don't know the indexing scheme? If you know it uses conventional indexing, it's trivial: 10:lastindex(x)-5
.
And I think it would be particularly hard to come up with good abstractions for a BioSequence type which also includes CenteredSequence
.
Again, I think your proposal is eminently reasonable, and I get its advantage when working with peptides presented by MHC and such. I just don't think we can actually implement it in a fashion that doesn't cause a lot of grief and pain and bugs down the road.
I'll keep this open in case some other devs want to chime in.
By the way, the unexpected behaviour of unique
is due to a bug I discovered a few days ago, #243 . It's fixed in the newly-released version 3.0.2.
Edit: Ah no, sorry, that bug is simply due to not having a fallback of hash
for BioSequence until v3.0.2.
Thank you for your thorough answer! You've pointed me to issues on correctness in the Julia ecosystem that I did not know existed 😨 I completely agree with you about the can of worms CenteredSequence
type might open
Ah, a fellow immunologist! Hello :wave: (I don't actually do much immunology anymore, but it's close to my heart).
One thing that @jakobnissen did not mention is that one of the things that's great about julia is that you can easily make a CenteredSequences.jl
package that implements your proposed interface. You can even subtype off of BioSequence{A}
as you did, but I think it's important to be well aware of the caveats that Jakob mentions. And I think I agree that it doesn't make sense to include it in the main BioSequences.jl
packages because of how difficult it would be to maintain. But if this is useful for your research, I'd encourage you to do it!
I think the correctness issues are not really as bad as that post makes out for the ecosystem as a whole (see this thoughtful but novel-length discussion), though when you're pushing the boundaries of functionality, it's important to keep in mind. Extensive testing is key!
I have a current need for this functionally.
@jakobnissen, @kescobo, and @SabrinaJaye, I wonder whether it is more reasonable to flip the responsibility of @gszep proposal and simply allow something like the following to work?
using BioSequences
using OffsetArrays
offset = -4000:4000
seq = randdnaseq(length(offset))
oa_seq = OffsetVector(seq, offset)
Currently, with BioSequences v3.1 the code errors with the following.
ERROR: MethodError: no method matching (OffsetVector{T} where T)(::LongSequence{DNAAlphabet{4}}, ::UnitRange{Int64})
Closest candidates are:
(OffsetVector{T} where T)(::AbstractArray, ::Any...; kw...) at ~/.julia/packages/OffsetArrays/80Lkc/src/OffsetArrays.jl:207
Stacktrace:
[1] top-level scope
@ REPL[7]:1
So, for this functionality to work, BioSequence
would need to subtype AbstractArray
. But, I last recall there was some discussion as to whether BioSequence
should subtype from AbstractString
or AbstractArray
. Now that there appears to be a case to subtype from AbstractArray
, what are people's thoughts?
OffsetBioSequences.jl
or something. I'm quite wary of bringing in the added maintenance burden, when a separate package (that can be independently developed or bitrot as needed) that can add the OffsetVector(seq...)
method. It's maybe a bit of type piracy, but :shrug:And there's probably a better way that doesn't do type piracy but can freely convert back and forth. Anyway, I say I'm concerned about maintenance burden, but I'm not the primary maintainer here, so please hold my objection lightly.
Right, so I still stand by my arguments in #173 - subtyping AbstractVector
won't do us any favors in the long run. And to be honest, I think that OffsetArrays
is an excellent warning against what happens if you create a subtype which violates fundamental assumptions about its supertype.
To implement CenteredSequence
, the way I recommend would be:
BioSequences
BioSequences
which throws / is buggy on CenteredSequence
.There are some design issues that you might want to consider. For example, should eachindex
be supported? If so, what should it return, if there are infinite valid indices? Same with firstindex
, lastindex
. What should iterating over the sequence return? It can't return the elements in order since they "begin" with an infinite series of gaps.
Oh darn, I misspoke. I don't need the gap padding, just the offset - so no violation of assumptions as suggested.
As an aside, with the responsibility inverted, the CenteredSequence
functionality might be composed in the following way.
using BioSequences
using BioSymbols
using OffsetArrays
"Unoptimsed PaddedSequence"
struct PaddedSequence
seq
end
"Unoptimsed getindex method for PaddedSequence"
function Base.getindex(padded::PaddedSequence, i::Int)
if firstindex(padded.seq) ≤ i ≤ lastindex(padded.seq)
return getindex(padded.seq, i)
end
return gap(eltype(padded.seq))
end
"Unoptimsed getindex method for PaddedSequence"
function Base.getindex(padded::PaddedSequence, r::UnitRange{Int})
f = Base.Fix1(getindex, padded)
return f.(r)
end
Checking the padded sequence.
julia> seq = dna"GATC" |> collect
4-element Vector{DNA}:
DNA_G
DNA_A
DNA_T
DNA_C
julia> pseq = PaddedSequence(seq)
PaddedSequence(DNA[DNA_G, DNA_A, DNA_T, DNA_C])
julia> pseq[-1:6]
8-element Vector{DNA}:
DNA_Gap
DNA_Gap
DNA_G
DNA_A
DNA_T
DNA_C
DNA_Gap
DNA_Gap
Now using Vector{DNA} as a standing for BioSequences
subtyped from AbstractArray
.
seq = collect(seq)
4-element Vector{DNA}:
DNA_G
DNA_A
DNA_T
DNA_C
julia> offset = -1:2
-1:2
julia> oa_seq = OffsetVector(seq, offset)
4-element OffsetArray(::Vector{DNA}, -1:2) with eltype DNA with indices -1:2:
DNA_G
DNA_A
DNA_T
DNA_C
julia> oa_pseq = PaddedSequence(oa_seq)
PaddedSequence(DNA[DNA_G, DNA_A, DNA_T, DNA_C])
julia> oa_pseq[-4:4]
9-element Vector{DNA}:
DNA_Gap
DNA_Gap
DNA_Gap
DNA_G
DNA_A
DNA_T
DNA_C
DNA_Gap
DNA_Gap
I don't think eachindex
, firstindex
, or lastindex
make much sense for a PaddedSequence
.
Sometimes it makes sense to begin indexing a sequence from its center. For example with hyper-variable loop regions for T cell / B cell receptors or peptides presented by antigen presenting cells. These are short amino acid sequences (lengths<20) whose start and end sequences are structured and center sequences are highly variable. Therefore I propose a
CenteredSequence{A}
typeExpected Behavior
AA_Gap
instead of out of bounds error, making it easier to align sequences of variable lengthsPossible Implementation
However there seem to be issues with this implementationEdit: fixed in BioSequence v3.0.2