Closed matz-e closed 3 years ago
I agree although I would go full regex rather than glob, personally: so, a nodeset could have: `{"foo": {"mtype": "/[lL]5.*"}}`
I've added the '/' as the signal that it's a regex. To me it's the most natural marker of a regex, but that's debatable. I'm not what the collision probability is of people using '/' as the start of a string value for attributes, but it's possible for directories...
Another option would be {"foo": {"mtype": "regex:[lL]5_.*"}}
, or something more verbose, I suppose.
Fine with me, I guess. The added benefit of using full-blown regex is support by the STL for new enough standards.
With a prefix, I would suggest maybe adding re:
and val:
to be able to fully disambiguate (the latter being the default). Slightly shorter than regex:
:)
Hi guys,
I already have something like this in bluepysnap and bluepy. I must say, I am not really in favor of adding the "function" itself (here: regex:
) to the argument (here : "[lL]5_.*"
). I removed this kind of queries entirely from bluepy and bluepysnap.
I prefer to have something more generic (but more verbose) that we can apply to other kind of complex queries in the future. I am now using a ~"à la mongo" type of queries :
{"mtype": {'$regex': '[lL]5_.*'}}
So applied to the nodesets :
`{'foo': {"mtype": {'$regex': '[lL]5.*'}}}`
For the regex.
This kind of syntax allows us to have more complex queries in the future like:
{"mtype": {'$complex_function': [arg1, arg2]}}
where it is easy to catch the function name + arguments (no matter the types and the number of args).
I am also using this kind of syntax to create or
, and
and node_set
queries. These are in different scopes tho.
Ex :
from bluepysnap import Circuit
circuit = Circuit("config_file.json")
nodes_pop = circuit.nodes["pop_name"]
nodes_pop.ids({"$node_set": 'Node12_L6_Y'})
nodes_pop.ids({"$and": [{"mtype": 'morphtype-a', "population": "default"}, {"etype": "etype-B"}]})
nodes_pop.ids({"$or": [{"mtype": {'$regex': [lL]5_.*'}, "population": "default"}, {"etype": "etype-B"}]})
You can of course combine the and/or queries.
Sounds like a solid proposal to me!
I think we should stay as close to the SONATA standard as possible: nesting another dictionary doesn't seem to help much, and the more 'complex' versions can be created from the current building blocks, no?
The sonata standard is very limited and only takes into account : exact match for a value, the or
for a single variable (via brackets: {"val": [1,2]}
), the or
for the compounds node sets, the and
({"val1": 1, "val2":2}) for different fields.
But for the floats or even comparison operators like: greater/greater_than, lower/lower_than, or ranges.
Or the strings: startswith (if we need better perfs), contains, regex.
Or the and
for node sets or queries, the not
for the queries etc
this is a complete limbo.
These are stuffs we need to be able to do if we want to provide a real query experience. I think this is up to us to add the extra queries we need, imagine a good way of doing this and to include it in the standard.
So today, the status for queries and node sets (which are related but not exactly similar) :
and
of multiple node sets.or
only via the node sets names (the compounds) so this is not possible to use this in a normal query like :
nodes.ids({"$or": [{"mtype": "mtype-1"}, {"etype": "etype-1"}]})
if the sub-queries are not defined in the node set file already.and
and or
for node sets --> you need to extract the ids and then to combine them yourself.>
, >=
, !=
, etc ...)nesting another dictionary doesn't seem to help much
I disagree with this. As I said, I really don t like the "regex:
{"mtype": {"$regex": "[lL]5_.*"}}
{"mtype": {"$startswith": "L5"}}
{"position_x": {"$gt": 600}}
{"position_x": {"$gte": 600}, "position_x": {"$lte": 800}}
{"position_x": {"$in": [600, 800]}}
I disagree with this.
Yeah, that's a good point.
We're basically recreating a full query language; I'm still bitter we didn't save the nodes file in sqlite.
Anyways, I think we need to do a test implementation with the above proposal, and see if it works well.
It occurs to me that if we're doing $gt/$lt, etc, perhaps the querying should also apply to EdgePopulations
. Before, directo comparison wasn't that useful, since most (-ish?) things are float in EdgePopulations
, but now it might be useful?
Could be useful for selecting wider brain-regions?
One obvious drawback would be the sheer size of the edge files. Now if only we had an MPI-capable reader, this may be optimized :)
One obvious drawback would be the sheer size of the edge files.
Loading one column at a time should be doable, no? Might have to chunk it, I guess.
MPI-capable reader
I won't bite. If you have MPI, you have enough ram on a node to load a column at a time...
Loading one column at a time should be doable, no? Might have to chunk it, I guess.
If we offer it, they will use it. I think right now the scale of the issue may make processing whole columns in memory feasible for the nearerst future.
Then there is also the speed of the selection. Of course, one could parallelize differently, e.g., just throw rayon
at it, since it's so easy…
We're basically recreating a full query language; I'm still bitter we didn't save the nodes file in sqlite.
I agree or in whatever graphDB with a Cypher like queries :
MATCH (n:neuron_population_name)
WHERE n.mtype = 'SLM' AND n.etype IN ['BC','AC'] AND n.morphology CONTAINS 'L6'
RETURN n
or if we want to use the edges in the queries :
MATCH (n:neurons)-[connection:neuron_astrocyte]->(astro:astrocytes)
WHERE n.mtype = 'SLM'
AND NOT exists((n)-[:neuro_neuro]->(:neurons))
RETURN astro.id
...
FYI: I started on this
Would be nice if the string matching in the NodeSet functionality (and by extension, I guess, attribute matching) could perform a glob match to not have to specify dozens of values.
(spin off from #118)