FASTX.jl has an issue with missingness. Accessing a missing record throws an error, accessing a missing sequence return an empty sequence, and accessing a missing identifier returns nothing. We should strive to unify these. But unify them to what?
It is clear to me that a missing sequence returning an empty sequence is plain wrong. Not only is it wrong, it's also obnoxiously dangerous - it presents an annoying edge case. So agrees @SabrinaJaye. That leaves two options: Throwing an error, or returning a sentinel value like nothing.
The argument in favor of error-throwing is that you are guaranteed to only get either the kind of object or expect.
The argument in favor of nothing is that it's much easier to recover from (check for nothing) than from an error (use try catch).
However, quite often, perhaps most of the time, the user is put in a situation where they don't really know whether e.g. the header is actually filled in or not. In those cases, the argument that the user is guaranteed to always get an AbstractString feels very academic, because their program could instead crash. Worse, this cannot easily be checked for statically.
The exception is when attempting to do something that is actually not allowed, like attempting to construct an invalid Record.
So what I propose is:
We remove the concept of a "filled" record. For backwards compatibility, we just let isfilled always return true. Attempting to construct an empty record throws an error
Accessing any optional field return nothing if they are not present. Only: FASTA.description and FASTQ.description may be missing.
FASTA
FASTA files always have an identifier, but it may be the empty string.
FASTA files may or may not have a description. It has a description if there is whitespace AND non-whitespace after the identifier. If there is whitespace immediately after the > symbol, then non-whitespace, the identifier is empty and there is a description.
Checklist
[x] - FASTA/Q header and identifier always return a StringView
[x] - FASTA/Q descritption returns nothing or StringView
[x] - FASTA/Q sequence is always non-empty and always return a sequence
[x] - FASTQ quality is always non-empty and returns a an AbstractVector
[x] - Constructors check for non-empty mandatory fields. Make isfilled always true.
To make an old discussion [1] more concrete:
FASTX.jl has an issue with missingness. Accessing a missing record throws an error, accessing a missing sequence return an empty sequence, and accessing a missing identifier returns
nothing
. We should strive to unify these. But unify them to what?It is clear to me that a missing sequence returning an empty sequence is plain wrong. Not only is it wrong, it's also obnoxiously dangerous - it presents an annoying edge case. So agrees @SabrinaJaye. That leaves two options: Throwing an error, or returning a sentinel value like
nothing
.The argument in favor of error-throwing is that you are guaranteed to only get either the kind of object or expect. The argument in favor of
nothing
is that it's much easier to recover from (check fornothing
) than from an error (usetry catch
).However, quite often, perhaps most of the time, the user is put in a situation where they don't really know whether e.g. the header is actually filled in or not. In those cases, the argument that the user is guaranteed to always get an
AbstractString
feels very academic, because their program could instead crash. Worse, this cannot easily be checked for statically.The exception is when attempting to do something that is actually not allowed, like attempting to construct an invalid Record.
So what I propose is:
isfilled
always returntrue
. Attempting to construct an empty record throws an errornothing
if they are not present. Only:FASTA.description
andFASTQ.description
may be missing.FASTA
Checklist
isfilled
always true.I'll make a PR with these changes soon.
1: https://github.com/orgs/BioJulia/teams/developers/discussions/2