BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

Deal better with missingness #63

Closed jakobnissen closed 2 years ago

jakobnissen commented 2 years ago

To make an old discussion [1] more concrete:

FASTX.jl has an issue with missingness. Accessing a missing record throws an error, accessing a missing sequence return an empty sequence, and accessing a missing identifier returns nothing. We should strive to unify these. But unify them to what?

It is clear to me that a missing sequence returning an empty sequence is plain wrong. Not only is it wrong, it's also obnoxiously dangerous - it presents an annoying edge case. So agrees @SabrinaJaye. That leaves two options: Throwing an error, or returning a sentinel value like nothing.

The argument in favor of error-throwing is that you are guaranteed to only get either the kind of object or expect. The argument in favor of nothing is that it's much easier to recover from (check for nothing) than from an error (use try catch).

However, quite often, perhaps most of the time, the user is put in a situation where they don't really know whether e.g. the header is actually filled in or not. In those cases, the argument that the user is guaranteed to always get an AbstractString feels very academic, because their program could instead crash. Worse, this cannot easily be checked for statically.

The exception is when attempting to do something that is actually not allowed, like attempting to construct an invalid Record.

So what I propose is:

Checklist

I'll make a PR with these changes soon.

1: https://github.com/orgs/BioJulia/teams/developers/discussions/2