Closed CJ-Wright closed 5 years ago
Attn: @chiahaoliu @sbillinge
Definitely not on the broker (as that gets into all sorts of 'how long until we invalidate this' questions). The broker currently has a prepare_hook
function (see https://nsls-ii.github.io/databroker/api.html?highlight=prepare_hook#advanced-controlling-the-return-type) which you could set from the client to stash the query on the returned object.
What is the motivation for this? Data should be equivalent, independent of how you asked for it.
This is for creating Bundles
. In the bundle creation it would be nice to be able to store how the multiple headers were gotten (the query sent to the databroker). Although this could be more generally applicable, since some times you'd like to re-run a query, eg after more data came in, and it would be handy to say "I liked this query, how did I get it/how can I get more like it".
here is my UC: 1) experimenter has somewhat complicated data. For whatever reason she only wants to propagate every fourth image, and eliminates all images tagged with "bad" 2) experimenter spends a lot of time creating a really precise filter that gets just exactly what she wants 3) experimenter saves the filter and goes home to bed, exhausted. 4) experimenter comes in the next day and wants to pick up where she left off. She pulls the filter from a database and works more on it.
Good use case. I have been that "experimenter" myself! Before we jump to a solution, I'd be interested to understand the problem a little better.
Is the filter generated by some GUI, or programmatically?
What problem is the stashing the filter in a database intended to solve? For the sake of argument, could it be stashed locally either by the GUI (~/.config/my_awesome_gui/filters/my_awesome_filter.json
) or directly by the user as a Python script (~/my_awesome_filter.py
)?
I gravely misunderstood what @CJ-Wright was asking initially :sheep: .
@danielballan 's suggesting of starting with a json or python representation seems like a good place to start. This also suggests that along side the broker we need a first-class Query
object.
Simon's UC also greatly blurs the lines of how we think about the things coming out of the data broker (the difference between selecting which Start documents are of interest and slicing into the data of a single run).
:+1: on the first-class Query
object
Just to be sure we're on the same page -- would a Query
encapsulate a query over Headers -- i.e. a dict -- and a query for (a subset of) Events in those Header(s)?
Once you start to get to the second it gets really messy really fast as you will want to put in conditionals on the slicing based on what is in the header or the data.
I think to start with we should have a Query object that is just a heavily managed dict that provides &
and |
operations and knows how to turn it self into some nice serialization (json)? and into the input that the Broker needs to search the HeaderSource
objects. This may be a place where we can provide a shim to make mongo and sql look the same (to the user) by giving it _mongo_query_
, _sql_query_
, _elastic_query
, _graphql_query_
, ... methods.
We should also see if we can just use graphql to solve this problem (I suspect it will only solve most of it, it works because you can specify all of you objects schemas well which we can not, in general, do so for Start
documents (yet).
I think that part of this line blurring can be helped by combining Query
objects and pipelines. If the Query
object is handed to the pipeline in some way that it could understand then we could write down somewhere exactly how you asked for the multi-header results. At this point the data could be bundled together into one "header" which contains metadata from all the incoming headers and interleaves the events (maybe by event timestamp?). This single header would pass into the pipeline to be filtered as needed (max image intensity, sequence number mod 4?). By storing the provenance of the pipeline we can reproduce the results of this header + event level filter (we can get the headers from the query, combine them in the same way, and run them through the exact same pipeline).
So a few comments here, in somewhat random order. 1) on the local vs database stashing of the info, I am interested in extending the event model and pipelining to data analysis for provenance purposes. So I would like to create some kind of analysis tree, get it to some point where I do a deliberate export (🥇 ), then have someone else come along and get the DE from it s doi, and then rerun it and work on it. That is a big lift obviously, but let's at least move in that direction. 2) I think the bundling of data (or data-streams) contains a lot of the complexity of any data analysis. It is the kind of munging step and often takes the majority of the time. To put that time in and then not have it as part of the provenance is counter-productive. 3) We are just starting to "play" with these concepts, and don't know the best way either to do them, or capture them, so this will evolve over time. From which point of view maybe it isn't the best to PR it into databroker, but CJ has his reasons. However, conceptually, the munging and filtering could be thought of as legit data analysis steps on the graph, not so different than "subtracting background" or sthg.
Not sure if these comments help but....
btw, I am spending a lot of time these days working with records databases for my group activities....grants, people, appointments etc..
Something that I am enjoying using is making simple filter functions and then chaining them. For example, I want all my grant applications that are (current AND funded) so I do this by:
grdb = load_up("grdb")
use1 = is_current(grdb)
use = is_key_has_value(use1,"application_status","approved")
This sequence would give AND. For OR I would give grdb as the argument in each case, and then I would chain
the results.
For me this is proving to be a powerful and intuitive way to do the munging. Not sure if it scales performance-wise and so on, so building complicated mongo queries and then applying them may in one go may end up being better, but for now, this is working for me.
@sbillinge you should have a look at https://toolz.readthedocs.io/en/latest/api.html which provides functional (streaming when possible) primitives for many of those operations along with https://docs.python.org/3.6/library/itertools.html and the built in filter
https://docs.python.org/3.6/library/functions.html#filter
I am not sure we want to hand 'query' like things into pipelines as then the pipeline code would have to know what to hand that query to!
From the PoV of provenance, why does the query used to find the headers matter?
I think the bundling of data (or data-streams) contains a lot of the complexity of any data analysis. It is the kind of munging step and often takes the majority of the time. To put that time in and then not have it as part of the provenance is counter-productive.
One of the goals of the Document model is to do as much of the 'correct' bundling at collection time and avoid having to do large amounts of multi-start merging operations at the top of analysis pipelines or single start 'forking' (particularly based on sequence number!). Can you push any of this information back into xpdacq and how it manages creating start documents / streams?
I think everyone wants 1 and definitely agree on 3
The Catalog of search results returned by intake knows its own query, which I think addresses the original prompt here.
Would it be possible to stuff the query information (the actual kwargs and the filters) into the
Broker
or theResults
? This would enable users to get back the query that they ran to get a given result.