apetkau / genomics-data-index

Indexes genomes using SNVs, MLST, or kmers for rapid querying, clustering, and visualization.
Apache License 2.0
12 stars 1 forks source link

Implement query language in command-line interface #74

Open apetkau opened 3 years ago

apetkau commented 3 years ago

It would be nice to have a way of defining queries on samples using the command-line interface in addition to the Python API.

Option 1: use Unix pipes

Queries in my system operate on sets of sample IDs, which are encoded using pyroaring. These can be easily serialized/deserialized into bytes.

So, this gives me an idea of a command-line interface using Unix pipes and sample sets. For example:

gdi query hasa 'S:D614G' | gdi query hasa 'S:G142D' --summarize

This would select those samples with the D614G mutation, encode the sample IDs using pyroaring and pass to stdin of another gdi instance which would deserialize the sample sets and then select those with the G142D mutation. Finally, a summary table of the results would be printed to the user. This would be the equivalent in the Python API of:

db.samples_query().hasa('S:D614G').hasa('S:G142D').summary()

This could be combined with building trees/alignments. For example:

gdi query hasa 'S:D614G' | gdi build alignment > d614g.aln

This would build an alignment of all those genomes with a D614G mutation.

This still needs some thought on how to work out. For example, I need to encode both present/unknown sample sets and find some way of automatically detecting if data is being piped or printed to a terminal/file (where I may want to display human-readable results instead of encoded sets of sample identifiers).

apetkau commented 3 years ago

AND/OR/NOT boolean operations could be specified with a command-line option:

gdi query --not hasa 'S:D614G' | gdi query --or hasa 'S:G142D' --summarize

This would be read as "find all samples that do NOT have a D614G mutation OR have a G142D mutation".

apetkau commented 3 years ago

Option 2: Decode queries from a single string

As an alternative, I could just specify a string query language which can be passed directly to a single instance of gdi:

gdi query 'not hasa:S:D614G or hasa:S:G142D'

The advantage here is that I don't have the overhead of creating multiple instances of gdi over and over again. Plus, this query language could be re-used for e.g., web searching.

A disadvantage is that it becomes difficult to encode complicated queries using strings like this. For example, for distance-based queries how would I specify units and type of distance query? With a command-line interface I can include them as options --distance-unit or --distance-type.