Record type inference - Githubissues

cjprybol commented 1 year ago

Is there anything available for inferring the FASTA record type from the sequence?

In earlier versions of FASTX I think this was done by default, and all of the records read in by Readers were returned as variants of LongSequence rather than strings. Now the same functionality is available optionally if you specify the return type when calling FASTX.sequence({desired_return_type}, record)

What I'm looking for is something along the lines of

FASTX.sequence(BioSequences.LongSequence, record) -> Union{LongDNA, LongRNA, LongAA}

#or BioSequences would be a better home for the inference
BioSequences.LongSequence(FASTX.sequence(record)) -> Union{LongDNA, LongRNA, LongAA}

Expected Behavior

Ambiguous interpretations lead to errors

julia> x = FASTX.FASTA.Record("identifier" , "AAAA")
FASTX.FASTA.Record:
  description: "identifier"
     sequence: "AAAA"

julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
4nt DNA Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
4nt RNA Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongAA, x)
4aa Amino Acid Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongSequence, x)
Error -> ambiguous sequence type

unambiguous interpretations lead to auto-inferred sequence types

julia> x = FASTX.FASTA.Record("identifier" , "AAAAJ")
FASTX.FASTA.Record:
  description: "identifier"
     sequence: "AAAAJ"

julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.DNAAlphabet{4}()
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
 [3] encode_chunk
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
 [5] copyto!
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
 [6] BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
 [7] LongSequence
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
 [8] sequence(::Type{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [9] top-level scope
   @ REPL[21]:1

julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.RNAAlphabet{4}()
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] throw_encode_error(A::BioSequences.RNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
 [3] encode_chunk
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
 [5] copyto!
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
 [6] BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
 [7] LongSequence
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
 [8] sequence(::Type{BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [9] top-level scope
   @ REPL[22]:1

julia> FASTX.sequence(BioSequences.LongAA, x)
5aa Amino Acid Sequence:
AAAAJ

julia> FASTX.sequence(BioSequences.LongSequence, x) #can only be an AA seq, per the available alphabets
5aa Amino Acid Sequence:
AAAAJ

Current Behavior

Can't use a generic LongSequence for any record

julia> FASTX.sequence(BioSequences.LongSequence, x)
ERROR: MethodError: no method matching BioSequences.LongSequence(::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true})
Closest candidates are:
  BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
  BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
  BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
  ...
Stacktrace:
 [1] sequence(::Type{BioSequences.LongSequence}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [2] top-level scope
   @ REPL[19]:1

julia> BioSequences.LongSequence(FASTX.sequence(x))
ERROR: MethodError: no method matching BioSequences.LongSequence(::StringViews.StringView{SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}})
Closest candidates are:
  BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
  BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
  BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[26]:1

Context

In addition to validating whether a FASTA is valid https://biojulia.github.io/FASTX.jl/latest/fasta/#FASTX.FASTA.validate_fasta it would be useful to have functionality to auto-infer the type of records in the FASTA

I'd need to think through the most logical way to check, but I think an order of operations to infer the best alphabet match might be like

DNA{2} -> RNA{2} -> DNA{4} -> RNA{4} -> AA

The AA alphabet (letter codes, not molecules) seems to be a superset of DNA/RNA alphabet, and the T/U difference I think is enough to differentiate between DNA/RNA

link to codes https://www.ddbj.nig.ac.jp/ddbj/code-e.html

cjprybol commented 1 year ago

jakobnissen commented 1 year ago

This is a good idea, and I think we should have this feature, with a few caveats.

First, IMO, it belongs in BioSequences, not FASTX (hence, I transferred the issue to this repository). The reason is principal: The biological sequence type is not a feature of the FASTA format, in which the sequences really are just text that can contain anything - indeed, they often contain non-standard symbols. This is also the motivation why FASTA.sequence returns a string in FASTX v2, where it returned a BioSequence in v1 - all records really contain are strings, and interpreting them to a specific biological alphabet is a distinct process, which is done by BioSequences.

The second caveat is that autodetection of sequence type is bound to be both flaky and inefficient, no matter how we do it. That's actually why we removed autodetection for v2 (see https://github.com/BioJulia/FASTX.jl/issues/59). As one of the goals of BioJulia more broadly is to allow people to use robust software, we should be wary of adding flaky functions that users might accidentally rely on, and as a result, produce unreliable software.

That doesn't mean we can't have it, but it should just be named something like guess_parse or guess_seq, which makes it clear that that's all it's doing. I think it's worth having, for convenient REPL work where it's fine if it's a little unreliable and slow. I agree in that case, we should just check each of the predefined alphabets in order, and error if it's ambiguous between RNA and DNA.

We might also want to remove the method for kmers you linked to, before Kmers.jl is released, for the same reasons.

jakobnissen commented 1 year ago

A few ideas for implementing this:

Using tryparse (see #224), simply trying each alphabet in order
Make a length NTuple{256, UInt8} lookup table (can also fit in a length-128 table) where each byte encodes four bits: If it contains valid DNA, valid RNA, IUPAC ambiguous nucleotides, or valid AA. Then we simply check each byte and AND the bitmask together before making the guess. If no bits are set, we error. If only AA bit is set, we pick amino acid alphabet. If both RNA and DNA alphabets is set, we error. Else we pick DNA/RNA, and 2/4 bits based on the ambiguous nucleotide bit.

BioJulia / BioSequences.jl

Record type inference #268

Expected Behavior

Current Behavior

Context