GenSpectrum / LAPIS-SILO

Sequence Indexing engine for Large Order of genomic data
GNU Affero General Public License v3.0
12 stars 3 forks source link

Generalize wildcard lineage queries: Support more general hierarchies #458

Open corneliusroemer opened 3 months ago

corneliusroemer commented 3 months ago

Motivation

@Taepper, @chaoran-chen and I have discussed generalizing the Pango wildcard search feature, as monophyletic clade search is important (non-wildcard searches are paraphyletic, which is usually not what one wants to do). In fact, if it was up to me, I would make wildcard/monophyletic search the default behaviour and make paraphyletic searches opt in, e.g. through ! or maybe ^ operator instead of requiring *.

The current hierarchy feature is very restrictive: it requires a grammar of ALIAS[.DECIMALNUMBER]{0,3}

Many viruses have Pango-like lineage systems, but almost none use the exact spec according to the grammar above.

For example (using a semi-pseudo-regex-style-grammar):

Goal

Generalize the existing hierarchical search feature to be compatible with the above grammars.

Proposal spec

The simplest way to encompass general monophyletic search on a tree of lineages is to not bother with the grammars at all, and simply treat lineage names as string labels, that are related in a (multi-)tree-like fashion. Initially, we probably want to think about these as a DAG (directed acyclical (multi)graph).

To take away the grammar complexity from SILO/LAPIS and work with all DAGS, all we need is a simple edge list as input config file. Done!

These are easy to generate programatically from the pango_alias.json file, and others, they'd be a bit more verbose, but that doesn't matter as a few thousand edges are not a problem at all for computers.

The simplest way to do it is (I'm using yaml for ease of reading/writing, but this could well be JSON):

LINEAGE_LABEL:
  parent: PARENT_LABEL
  aliases: 
  - ALIAS_LABEL1
  - ALIAS_LABEL2
  other_metadata (optional): dict

Example for a subset of Pango:

B:
  parent: null
  aliases: null
  other_metadata:
    designation_date: 2020-04-01
B.1:
  parent: B
  aliases: null
B.1.1:
  parent: B.1
  aliases: null
B.1.1.529:
  parent: B.1.1
  aliases: 
  - BA
BA.2:
  parent: B.1.1.529
  aliases:
  - B.1.1.1.529.2
BA.2.86:
  parent: BA.2
  aliases:
  - B.1.1.529.2.86
BA.2.86.1:
  parent: BA.2.86
  aliases:
  - JN
  - B.1.1.529.2.86
JN.1:
  parent: BA.2.86.1
  aliases:
  - BA.2.86.1.1
  - B.1.1.529.2.86.1.1
JN.1.11:
  parent: JN.1
  aliases:
  - BA.2.86.1.1.11
  - B.1.1.529.2.86.1.1.11
JN.1.11.1:
  parent: JN.1.11
  aliases:
  - KP
  - BA.2.86.1.1.11.1
  - B.1.1.529.2.86.1.1.11.1
KP.3:
  parent: JN.1.11.1
  aliases:
  - JN.1.11.1.3
  - BA.2.86.1.1.11.1.3
  - B.1.1.529.2.86.1.1.11.1.3
...

I think you get the point. Now one interesting possible extension is to include recombinants not as their own trees but give them two parents. One could then offer an alternative wildcard query mode that treats a recombinant as a parent of both of their parents (but only counting them once if both parents are included in a query).

The recombinant multi-parent mode would be cool, but is by far not as important as supporting general schemes.

I'm happy to provide further examples, e.g. for mpox, it's quite trivial to automate these.

Taepper commented 2 days ago

@corneliusroemer I currently have a blocker: A current feature of pango_lineage columns is that they are case-insensitive. What do you think of this feature? Is this helpful or can we regress here? This would again place an assumption on lineage schemes, which we want to remove here. If we do not want this in general, is it worth it to put in extra time to maybe specify an alias scheme that would facilitate this, or insert a config flag?

corneliusroemer commented 2 days ago

Great question! So you are wondering whether lineages should be considered case insensitive or not.

In principle, it's good to support case insensitivity when the underlying system is case insensitive. But in general some crazy people could come up with a case sensitive system so it might make sense to potentially allow case sensitive search.

So it's best I think to start off making it case sensitive, and let clients (front) do normalization if clients want to support insensitive search. Does that make sense? It means less worry, less config, and good generalizability.

Taepper commented 2 days ago

Alright, if @chaoran-chen is also happy with this, I will change this. This means in particular that casing of metadata file, lineage tree definition file, and queries need to match.

Also, should the result of queries that ask for a non-existing lineage return an empty result (current behavior), or give the user an error? This would now be possible, as the lineage tree definition file can be expected to be an exhaustive list of all lineages, whereas the alias list before was definitely non-exhaustive.

corneliusroemer commented 2 days ago

I think an error would make sense, make it so that the API consumer can handle it appropriately - i.e. covspectrum wouldn't show the raw error to User, it could wrap nicely and say "you queried a non-existent lineage"

I think it's good to require case consistency - one could at some point add a flag to normalize from lower to upper within silo but that's not necessary to do in silo, can be handled by client and keeps code base lighter.

Taepper commented 2 days ago

Thank you very much for the input :)

Exactly, that is what I envisioned with the errors, as this would now be an additional enhancement coming from this issue / generalization :)

corneliusroemer commented 2 days ago

Always happy to provide guidance/feedback, thanks for asking this great question! If you have any similar questions just shoot!