Open cjprybol opened 1 year ago
This is a good idea, and I think we should have this feature, with a few caveats.
First, IMO, it belongs in BioSequences, not FASTX (hence, I transferred the issue to this repository). The reason is principal: The biological sequence type is not a feature of the FASTA format, in which the sequences really are just text that can contain anything - indeed, they often contain non-standard symbols. This is also the motivation why FASTA.sequence
returns a string in FASTX v2, where it returned a BioSequence in v1 - all records really contain are strings, and interpreting them to a specific biological alphabet is a distinct process, which is done by BioSequences.
The second caveat is that autodetection of sequence type is bound to be both flaky and inefficient, no matter how we do it. That's actually why we removed autodetection for v2 (see https://github.com/BioJulia/FASTX.jl/issues/59). As one of the goals of BioJulia more broadly is to allow people to use robust software, we should be wary of adding flaky functions that users might accidentally rely on, and as a result, produce unreliable software.
That doesn't mean we can't have it, but it should just be named something like guess_parse
or guess_seq
, which makes it clear that that's all it's doing. I think it's worth having, for convenient REPL work where it's fine if it's a little unreliable and slow. I agree in that case, we should just check each of the predefined alphabets in order, and error if it's ambiguous between RNA and DNA.
We might also want to remove the method for kmers you linked to, before Kmers.jl is released, for the same reasons.
A few ideas for implementing this:
NTuple{256, UInt8}
lookup table (can also fit in a length-128 table) where each byte encodes four bits: If it contains valid DNA, valid RNA, IUPAC ambiguous nucleotides, or valid AA. Then we simply check each byte and AND the bitmask together before making the guess. If no bits are set, we error. If only AA bit is set, we pick amino acid alphabet. If both RNA and DNA alphabets is set, we error. Else we pick DNA/RNA, and 2/4 bits based on the ambiguous nucleotide bit.
Is there anything available for inferring the FASTA record type from the sequence?
In earlier versions of FASTX I think this was done by default, and all of the records read in by
Readers
were returned as variants ofLongSequence
rather than strings. Now the same functionality is available optionally if you specify the return type when callingFASTX.sequence({desired_return_type}, record)
What I'm looking for is something along the lines of
Expected Behavior
Ambiguous interpretations lead to errors
unambiguous interpretations lead to auto-inferred sequence types
Current Behavior
Can't use a generic LongSequence for any record
Context
In addition to validating whether a FASTA is valid https://biojulia.github.io/FASTX.jl/latest/fasta/#FASTX.FASTA.validate_fasta it would be useful to have functionality to auto-infer the type of records in the FASTA
I'd need to think through the most logical way to check, but I think an order of operations to infer the best alphabet match might be like
The AA alphabet (letter codes, not molecules) seems to be a superset of DNA/RNA alphabet, and the T/U difference I think is enough to differentiate between DNA/RNA
link to codes https://www.ddbj.nig.ac.jp/ddbj/code-e.html