GenSpectrum / LAPIS

An API, a query engine, and a database schema for genomic sequences; currently with a focus on SARS-CoV-2
https://lapis-three.vercel.app
GNU Affero General Public License v3.0
22 stars 6 forks source link

FASTA descriptions #857

Open theosanderson opened 4 months ago

theosanderson commented 4 months ago

(Assuming that it doesn't) it would be nice if LAPIS could support returning FASTAs with "descriptions", e.g. https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta has both an accession and then a description, separated by a space.

image

There are various implementations possible, from hardcoding the description field to allowing it to be selected in the query.

https://github.com/loculus-project/loculus/issues/2284#issuecomment-2220819967

corneliusroemer commented 4 months ago

I'll write a primer on Fasta header specs/grammar, so that whoever picks this up doesn't need to do the research themself.

Per the BioPython docs, it appears that id is the beginning of the header string until the first space. The rest is parsed into description. name seems to not be present in fasta files:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                        IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein")

print(record.format("fasta"))
>YP_025292.1 toxic membrane protein
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

From https://biopython.org/wiki/SeqRecord

The most authoritative "reference implementation" seems to be Bio::SeqIO::fasta: https://metacpan.org/pod/Bio::SeqIO::fasta

Relevant implementation: https://github.com/bioperl/bioperl-live/blob/9ce0d304f42ef3a6808e7e94be42cb56bb51068d/lib/Bio/SeqIO/fasta.pm#L281-L307

This is the grammar I (with the help of ChatGPT, but verifying correctness) deduced for fasta headers:

fasta_header ::= ">" identifier ( " " description )? "\n"
identifier    ::= ( printable_char_no_space )+
description   ::= ( printable_char - "\n" )*

printable_char_no_space ::= ? all printable ASCII characters except space ?
printable_char          ::= ? all printable ASCII characters ?

Biopython uses the same grammar, as far as I can tell: https://github.com/biopython/biopython/blob/af00cf6c79887d80df8673e5cacde0786415ce34/Bio/SeqIO/FastaIO.py

fengelniederhammer commented 4 months ago

In my impression, FASTA headers are more or less arbitrary. If we want to implement this feature, I would propose to aim for a general solution that is flexible and allows the use case that Loculus aims for.

Maybe the following approach makes sense:

Open question: What happens if the primary key is not in the fields?

  1. silently add it (to the beginning? to the end?)
  2. ignore it and leave the user with potentially useless fasta files
  3. throw an error

Probably 2. is more reasonable, since multiple metadata fields might be unique enough to be useful. That would give the user the flexibility to query for other unique fields that are not the primary key.

corneliusroemer commented 4 months ago

As I've commented elsewhere, in order to allow semantic parsing of the header into fasta id and description, there needs to be a whitespace after the id (and the id must not contain whitespace).

There are two options here:

I think @fengelniederhammer proposes option b, but we should probably consciously decide between the two, as option a has the advantage of being more spec compliant.

To make it more concrete: Using the following query: /sample/unalignedNucleotideSequence?fields=displayName,releaseDate&fieldSeparator=| looks like this for option a):

>LOC0123.1 |Germany/LOC0123.1/2023-01-04|2024-06-14
or
>LOC0123.1 Germany/LOC0123.1/2023-01-04|2024-06-14

for option b):

>LOC0123.1|Germany/LOC0123.1/2023-01-04|2024-06-14

Option b could result in surprising parse results using Bio(Python|Perl) in some cases, e.g. when there's spaces in the fields:

>LOC0123.1|Germany/LOC0123.1/2023-01-04|Homo sapiens

would parse as:

id='LOC0123.1|Germany/LOC0123.1/2023-01-04|Homo'
description=sapiens

Whereas option a:

>LOC0123.1 |Germany/LOC0123.1/2023-01-04|Homo sapiens
or
>LOC0123.1 Germany/LOC0123.1/2023-01-04|Homo sapiens

would parse as one would expect:

id='LOC0123.1'
description='|Germany/LOC0123.1/2023-01-04|Homo sapiens'
or
description='Germany/LOC0123.1/2023-01-04|Homo sapiens'
chaoran-chen commented 4 months ago

Thank you very much for the suggestion! I fully agree that this would be a useful feature.

Timeline-wise: I think that this is not trivial to implement, at least not if it should be efficient (which is one of the core values for LAPIS). Given that our focus at the moment is to polish SILO and prepare the official public launch, I don't think that we will work on this issue in the upcoming 2-3 months.

corneliusroemer commented 3 weeks ago

Gentle reminder that this would be really good to have

fengelniederhammer commented 3 weeks ago

LAPIS is now able to return sequences as JSON and NDJSON (https://github.com/GenSpectrum/LAPIS/issues/971, as of 0.3.6). Maybe this works as a workaround in the meantime? You could fetch the sequences and the /details as JSON in Loculus and build the FASTA header yourself?