BlueBrain / libsonata

A python and C++ interface to the SONATA format
https://libsonata.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
11 stars 12 forks source link

Glob matching in NodeSets #119

Closed matz-e closed 3 years ago

matz-e commented 3 years ago

Would be nice if the string matching in the NodeSet functionality (and by extension, I guess, attribute matching) could perform a glob match to not have to specify dozens of values.

(spin off from #118)

mgeplf commented 3 years ago

I agree although I would go full regex rather than glob, personally: so, a nodeset could have: `{"foo": {"mtype": "/[lL]5.*"}}`

I've added the '/' as the signal that it's a regex. To me it's the most natural marker of a regex, but that's debatable. I'm not what the collision probability is of people using '/' as the start of a string value for attributes, but it's possible for directories...

Another option would be {"foo": {"mtype": "regex:[lL]5_.*"}}, or something more verbose, I suppose.

matz-e commented 3 years ago

Fine with me, I guess. The added benefit of using full-blown regex is support by the STL for new enough standards.

With a prefix, I would suggest maybe adding re: and val: to be able to fully disambiguate (the latter being the default). Slightly shorter than regex: :)

tomdele commented 3 years ago

Hi guys, I already have something like this in bluepysnap and bluepy. I must say, I am not really in favor of adding the "function" itself (here: regex:) to the argument (here : "[lL]5_.*"). I removed this kind of queries entirely from bluepy and bluepysnap.

I prefer to have something more generic (but more verbose) that we can apply to other kind of complex queries in the future. I am now using a ~"à la mongo" type of queries : {"mtype": {'$regex': '[lL]5_.*'}} So applied to the nodesets : `{'foo': {"mtype": {'$regex': '[lL]5.*'}}}` For the regex.

This kind of syntax allows us to have more complex queries in the future like: {"mtype": {'$complex_function': [arg1, arg2]}} where it is easy to catch the function name + arguments (no matter the types and the number of args).

tomdele commented 3 years ago

I am also using this kind of syntax to create or, and and node_set queries. These are in different scopes tho.

Ex :

from bluepysnap import Circuit
circuit = Circuit("config_file.json")
nodes_pop = circuit.nodes["pop_name"]
nodes_pop.ids({"$node_set": 'Node12_L6_Y'})
nodes_pop.ids({"$and": [{"mtype": 'morphtype-a', "population": "default"}, {"etype": "etype-B"}]})
nodes_pop.ids({"$or": [{"mtype": {'$regex': [lL]5_.*'}, "population": "default"}, {"etype": "etype-B"}]})

You can of course combine the and/or queries.

matz-e commented 3 years ago

Sounds like a solid proposal to me!

mgeplf commented 3 years ago

I think we should stay as close to the SONATA standard as possible: nesting another dictionary doesn't seem to help much, and the more 'complex' versions can be created from the current building blocks, no?

tomdele commented 3 years ago

The sonata standard is very limited and only takes into account : exact match for a value, the or for a single variable (via brackets: {"val": [1,2]}), the or for the compounds node sets, the and ({"val1": 1, "val2":2}) for different fields.

But for the floats or even comparison operators like: greater/greater_than, lower/lower_than, or ranges. Or the strings: startswith (if we need better perfs), contains, regex. Or the and for node sets or queries, the not for the queries etc this is a complete limbo.

These are stuffs we need to be able to do if we want to provide a real query experience. I think this is up to us to add the extra queries we need, imagine a good way of doing this and to include it in the standard.

So today, the status for queries and node sets (which are related but not exactly similar) :

nesting another dictionary doesn't seem to help much

I disagree with this. As I said, I really don t like the "regex:" or "@". This is too string specific because this is the only usecase where you can add a function inside the argument. I prefer to have a common way of addressing all the possible operators and their arguments. It makes everything more clear and homogeneous. Like :

{"mtype": {"$regex": "[lL]5_.*"}}
{"mtype": {"$startswith": "L5"}}
{"position_x": {"$gt": 600}}
{"position_x":  {"$gte": 600}, "position_x":  {"$lte": 800}}
{"position_x": {"$in": [600, 800]}}
mgeplf commented 3 years ago

I disagree with this.

Yeah, that's a good point.

We're basically recreating a full query language; I'm still bitter we didn't save the nodes file in sqlite.

Anyways, I think we need to do a test implementation with the above proposal, and see if it works well.

mgeplf commented 3 years ago

It occurs to me that if we're doing $gt/$lt, etc, perhaps the querying should also apply to EdgePopulations. Before, directo comparison wasn't that useful, since most (-ish?) things are float in EdgePopulations, but now it might be useful?

matz-e commented 3 years ago

Could be useful for selecting wider brain-regions?

One obvious drawback would be the sheer size of the edge files. Now if only we had an MPI-capable reader, this may be optimized :)

mgeplf commented 3 years ago

One obvious drawback would be the sheer size of the edge files.

Loading one column at a time should be doable, no? Might have to chunk it, I guess.

MPI-capable reader

I won't bite. If you have MPI, you have enough ram on a node to load a column at a time...

matz-e commented 3 years ago

Loading one column at a time should be doable, no? Might have to chunk it, I guess.

If we offer it, they will use it. I think right now the scale of the issue may make processing whole columns in memory feasible for the nearerst future.

Then there is also the speed of the selection. Of course, one could parallelize differently, e.g., just throw rayon at it, since it's so easy…

tomdele commented 3 years ago

We're basically recreating a full query language; I'm still bitter we didn't save the nodes file in sqlite.

I agree or in whatever graphDB with a Cypher like queries :

MATCH (n:neuron_population_name)
WHERE n.mtype = 'SLM' AND n.etype IN ['BC','AC'] AND n.morphology CONTAINS 'L6'
RETURN n

or if we want to use the edges in the queries :

MATCH (n:neurons)-[connection:neuron_astrocyte]->(astro:astrocytes)
WHERE  n.mtype = 'SLM'
AND NOT exists((n)-[:neuro_neuro]->(:neurons))
RETURN astro.id

...

mgeplf commented 3 years ago

FYI: I started on this