Open theosanderson opened 4 months ago
I'll write a primer on Fasta header specs/grammar, so that whoever picks this up doesn't need to do the research themself.
Per the BioPython docs, it appears that id
is the beginning of the header string until the first space. The rest is parsed into description
. name
seems to not be present in fasta files:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
IUPAC.protein),
id="YP_025292.1", name="HokC",
description="toxic membrane protein")
print(record.format("fasta"))
>YP_025292.1 toxic membrane protein
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
From https://biopython.org/wiki/SeqRecord
The most authoritative "reference implementation" seems to be Bio::SeqIO::fasta
: https://metacpan.org/pod/Bio::SeqIO::fasta
Relevant implementation: https://github.com/bioperl/bioperl-live/blob/9ce0d304f42ef3a6808e7e94be42cb56bb51068d/lib/Bio/SeqIO/fasta.pm#L281-L307
This is the grammar I (with the help of ChatGPT, but verifying correctness) deduced for fasta headers:
fasta_header ::= ">" identifier ( " " description )? "\n"
identifier ::= ( printable_char_no_space )+
description ::= ( printable_char - "\n" )*
printable_char_no_space ::= ? all printable ASCII characters except space ?
printable_char ::= ? all printable ASCII characters ?
Biopython uses the same grammar, as far as I can tell: https://github.com/biopython/biopython/blob/af00cf6c79887d80df8673e5cacde0786415ce34/Bio/SeqIO/FastaIO.py
In my impression, FASTA headers are more or less arbitrary. If we want to implement this feature, I would propose to aim for a general solution that is flexible and allows the use case that Loculus aims for.
Maybe the following approach makes sense:
/sample/alignedNucleotideSequences
, /sample/unalignedNucleotideSequences
and /sample/alignedAminoAcidSequences
fields
. That would create a unified API with /details
and /aggregated
. Default: [<primaryKey>]
.delimiter
. Default: /
.fields
parameter to the corresponding SILO actions.fields
(in the user-supplied order) and concatenating them with delimiter
.Open question: What happens if the primary key is not in the fields
?
Probably 2. is more reasonable, since multiple metadata fields might be unique enough to be useful. That would give the user the flexibility to query for other unique fields that are not the primary key.
As I've commented elsewhere, in order to allow semantic parsing of the header into fasta id
and description
, there needs to be a whitespace after the id
(and the id must not contain whitespace).
There are two options here:
I think @fengelniederhammer proposes option b, but we should probably consciously decide between the two, as option a has the advantage of being more spec compliant.
To make it more concrete:
Using the following query: /sample/unalignedNucleotideSequence?fields=displayName,releaseDate&fieldSeparator=|
looks like this
for option a):
>LOC0123.1 |Germany/LOC0123.1/2023-01-04|2024-06-14
or
>LOC0123.1 Germany/LOC0123.1/2023-01-04|2024-06-14
for option b):
>LOC0123.1|Germany/LOC0123.1/2023-01-04|2024-06-14
Option b could result in surprising parse results using Bio(Python|Perl) in some cases, e.g. when there's spaces in the fields:
>LOC0123.1|Germany/LOC0123.1/2023-01-04|Homo sapiens
would parse as:
id='LOC0123.1|Germany/LOC0123.1/2023-01-04|Homo'
description=sapiens
Whereas option a:
>LOC0123.1 |Germany/LOC0123.1/2023-01-04|Homo sapiens
or
>LOC0123.1 Germany/LOC0123.1/2023-01-04|Homo sapiens
would parse as one would expect:
id='LOC0123.1'
description='|Germany/LOC0123.1/2023-01-04|Homo sapiens'
or
description='Germany/LOC0123.1/2023-01-04|Homo sapiens'
Thank you very much for the suggestion! I fully agree that this would be a useful feature.
Timeline-wise: I think that this is not trivial to implement, at least not if it should be efficient (which is one of the core values for LAPIS). Given that our focus at the moment is to polish SILO and prepare the official public launch, I don't think that we will work on this issue in the upcoming 2-3 months.
Gentle reminder that this would be really good to have
LAPIS is now able to return sequences as JSON and NDJSON (https://github.com/GenSpectrum/LAPIS/issues/971, as of 0.3.6). Maybe this works as a workaround in the meantime? You could fetch the sequences and the /details
as JSON in Loculus and build the FASTA header yourself?
(Assuming that it doesn't) it would be nice if LAPIS could support returning FASTAs with "descriptions", e.g. https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta has both an accession and then a description, separated by a space.
There are various implementations possible, from hardcoding the description field to allowing it to be selected in the query.
https://github.com/loculus-project/loculus/issues/2284#issuecomment-2220819967