BioJulia / BioSequences.jl

Biological sequences for the julia language
http://biojulia.dev/BioSequences.jl
MIT License
150 stars 47 forks source link

Check interfaces, remove NotImplemented errors #140

Closed jakobnissen closed 2 years ago

jakobnissen commented 3 years ago

Errors such as e.g. https://github.com/BioJulia/BioSequences.jl/blob/0ba3d4b2e1dab04b77e244141501e6b33d53b975/src/biosequence/biosequence.jl#L21-L30

It's not more useful than a MethodError, and it hampers static analysis from e.g. JET, which will probably be a reality before the end of this year. Much better to simply remove it. (also see this blogpost)

Instead, we should go through and verify we have covered and documented interfaces for:

Also, the tests should include custom biosequence and alphabet types to make sure some basic behaviour is covered by fallbacks.

kescobo commented 3 years ago

This seems like a good idea, should we spitball what those interfaces actually are?

jakobnissen commented 3 years ago

Okay, so I decided to not make another issue. Here is my proposal for the interface for BioSequence and Alphabet. There are already inferfaces stated in the documentation, but if you actually try to implement a new sequence/alphabet type, you will find that they are insufficient (cc @BenJWard )

The following types are assumed to have the listed properties. You may create subtypes of that violate these properties, but then any fallback functions may be excessively slow or wrong.

What is an Alphabet?

What to implement?

Every subtype A of Alphabet must implement:

If you want interoperation with existing subtypes of BioSequence, the element type E must be UInt, and you must also implement:

Minimal example

In this example, the element type is AminoAcid, and the encoded data type is UInt.

struct ReducedAAAlphabet <: Alphabet end
Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid
BioSequences.BitsPerSymbol(::ReducedAAAlphabet) = BioSequences.BitsPerSymbol{4}()
function BioSequences.symbols(::ReducedAAAlphabet)
    (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F,
     AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M)
end

const (ENC_LUT, DEC_LUT) = let
    enc_lut = fill(0xff, length(alphabet(AminoAcid)))
    dec_lut = fill(AA_A, length(symbols(ReducedAAAlphabet())))
    for (i, aa) in enumerate(symbols(ReducedAAAlphabet()))
        enc_lut[reinterpret(UInt8, aa) + 0x01] = i - 1
        dec_lut[i] = aa
    end
    (Tuple(enc_lut), Tuple(dec_lut))
end

function encode(::ReducedAAAlphabet, aa::AminoAcid)
    i = reinterpret(UInt8, aa) + 0x01
    (i ≥ length(ENC_LUT) || @inbounds ENC_LUT[i] === 0xff) && throw(DomainError(aa))
    (@inbounds ENC_LUT[i]) % UInt
end

function decode(::ReducedAAAlphabet, x::UInt)
    x ≥ length(DEC_LUT) && throw(DomainError(aa))
    @inbounds DEC_LUT[x + UInt(1)]
end

What is a BioSequence?

What to implement?

Every subtype T of BioSequence{A} must implement the following methods, where x in an instance of the subtype:

Furthermore, mutable sequences should implement

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

Minimal example

struct Codon <: BioSequence{RNAAlphabet{2}}
    x::UInt8
end

function Codon(it)
    length(it) == 3 || error("Must have length 3")
    x = zero(UInt)
    for (i, nt) in enumerate(it)
        x |= BioSequences.encode(Alphabet(Codon), convert(RNA, nt)) << (6-2i)
    end
    Codon(x)
end 

Base.length(::Codon) = 3
BioSequences.encoded_data_eltype(::Type{Codon}) = UInt
function BioSequences.extract_encoded_element(x::Codon, i::Int)
    (x.x >>> ((i-1) & 7)) % UInt
end
TransGirlCodes commented 3 years ago

@jakobnissen So these two examples have made their way into the tests, and I can include them in the manual as example custom types if necessary. Can this close?

jakobnissen commented 3 years ago

Yes! Do they actually work? 😅 I kind of forgot if the custom types were thoroughly checked. Should I have a look tomorrow, or have you already tested the interfaces? If so, please do close this!

On Wed, Jun 23, 2021, 20:35 Ben J. Ward @.***> wrote:

@jakobnissen https://github.com/jakobnissen So these two examples have made their way into the tests, and I can include them in the manual as example custom types if necessary. Can this close?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioJulia/BioSequences.jl/issues/140#issuecomment-867069028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQ6SXW5RX4ZU7RM3OEDC7TTUISPFANCNFSM4Y76QFGQ .

TransGirlCodes commented 3 years ago

The tests all pass. But we should probably, do like in the blogpost suggests about anti-patterns, where there's a function that also other users can use, that tests any type conforms to the interface.

jakobnissen commented 2 years ago

I basically think this is done now, as #202 is merged and I've gone through the tests. It's not true that all tests also include an operation on a custom alphabet, but code coverage of the generic code paths should be good enough.