Open apetkau opened 3 years ago
AND/OR/NOT boolean operations could be specified with a command-line option:
gdi query --not hasa 'S:D614G' | gdi query --or hasa 'S:G142D' --summarize
This would be read as "find all samples that do NOT have a D614G mutation OR have a G142D mutation".
As an alternative, I could just specify a string query language which can be passed directly to a single instance of gdi
:
gdi query 'not hasa:S:D614G or hasa:S:G142D'
The advantage here is that I don't have the overhead of creating multiple instances of gdi
over and over again. Plus, this query language could be re-used for e.g., web searching.
A disadvantage is that it becomes difficult to encode complicated queries using strings like this. For example, for distance-based queries how would I specify units and type of distance query? With a command-line interface I can include them as options --distance-unit
or --distance-type
.
It would be nice to have a way of defining queries on samples using the command-line interface in addition to the Python API.
Option 1: use Unix pipes
Queries in my system operate on sets of sample IDs, which are encoded using pyroaring. These can be easily serialized/deserialized into bytes.
So, this gives me an idea of a command-line interface using Unix pipes and sample sets. For example:
This would select those samples with the
D614G
mutation, encode the sample IDs using pyroaring and pass to stdin of anothergdi
instance which would deserialize the sample sets and then select those with theG142D
mutation. Finally, a summary table of the results would be printed to the user. This would be the equivalent in the Python API of:This could be combined with building trees/alignments. For example:
This would build an alignment of all those genomes with a
D614G
mutation.This still needs some thought on how to work out. For example, I need to encode both present/unknown sample sets and find some way of automatically detecting if data is being piped or printed to a terminal/file (where I may want to display human-readable results instead of encoded sets of sample identifiers).