bluesky / databroker

Unified API pulling data from multiple sources
https://blueskyproject.io/databroker
BSD 3-Clause "New" or "Revised" License
35 stars 47 forks source link

Stash query information into Broker or Results #345

Closed CJ-Wright closed 5 years ago

CJ-Wright commented 6 years ago

Would it be possible to stuff the query information (the actual kwargs and the filters) into the Broker or the Results? This would enable users to get back the query that they ran to get a given result.

CJ-Wright commented 6 years ago

Attn: @chiahaoliu @sbillinge

tacaswell commented 6 years ago

Definitely not on the broker (as that gets into all sorts of 'how long until we invalidate this' questions). The broker currently has a prepare_hook function (see https://nsls-ii.github.io/databroker/api.html?highlight=prepare_hook#advanced-controlling-the-return-type) which you could set from the client to stash the query on the returned object.

What is the motivation for this? Data should be equivalent, independent of how you asked for it.

CJ-Wright commented 6 years ago

This is for creating Bundles. In the bundle creation it would be nice to be able to store how the multiple headers were gotten (the query sent to the databroker). Although this could be more generally applicable, since some times you'd like to re-run a query, eg after more data came in, and it would be handy to say "I liked this query, how did I get it/how can I get more like it".

sbillinge commented 6 years ago

here is my UC: 1) experimenter has somewhat complicated data. For whatever reason she only wants to propagate every fourth image, and eliminates all images tagged with "bad" 2) experimenter spends a lot of time creating a really precise filter that gets just exactly what she wants 3) experimenter saves the filter and goes home to bed, exhausted. 4) experimenter comes in the next day and wants to pick up where she left off. She pulls the filter from a database and works more on it.

danielballan commented 6 years ago

Good use case. I have been that "experimenter" myself! Before we jump to a solution, I'd be interested to understand the problem a little better.

Is the filter generated by some GUI, or programmatically?

What problem is the stashing the filter in a database intended to solve? For the sake of argument, could it be stashed locally either by the GUI (~/.config/my_awesome_gui/filters/my_awesome_filter.json) or directly by the user as a Python script (~/my_awesome_filter.py)?

tacaswell commented 6 years ago

I gravely misunderstood what @CJ-Wright was asking initially :sheep: .

@danielballan 's suggesting of starting with a json or python representation seems like a good place to start. This also suggests that along side the broker we need a first-class Query object.

Simon's UC also greatly blurs the lines of how we think about the things coming out of the data broker (the difference between selecting which Start documents are of interest and slicing into the data of a single run).

CJ-Wright commented 6 years ago

:+1: on the first-class Query object

danielballan commented 6 years ago

Just to be sure we're on the same page -- would a Query encapsulate a query over Headers -- i.e. a dict -- and a query for (a subset of) Events in those Header(s)?

tacaswell commented 6 years ago

Once you start to get to the second it gets really messy really fast as you will want to put in conditionals on the slicing based on what is in the header or the data.

I think to start with we should have a Query object that is just a heavily managed dict that provides & and | operations and knows how to turn it self into some nice serialization (json)? and into the input that the Broker needs to search the HeaderSource objects. This may be a place where we can provide a shim to make mongo and sql look the same (to the user) by giving it _mongo_query_, _sql_query_, _elastic_query, _graphql_query_, ... methods.

We should also see if we can just use graphql to solve this problem (I suspect it will only solve most of it, it works because you can specify all of you objects schemas well which we can not, in general, do so for Start documents (yet).

CJ-Wright commented 6 years ago

I think that part of this line blurring can be helped by combining Query objects and pipelines. If the Query object is handed to the pipeline in some way that it could understand then we could write down somewhere exactly how you asked for the multi-header results. At this point the data could be bundled together into one "header" which contains metadata from all the incoming headers and interleaves the events (maybe by event timestamp?). This single header would pass into the pipeline to be filtered as needed (max image intensity, sequence number mod 4?). By storing the provenance of the pipeline we can reproduce the results of this header + event level filter (we can get the headers from the query, combine them in the same way, and run them through the exact same pipeline).

sbillinge commented 6 years ago

So a few comments here, in somewhat random order. 1) on the local vs database stashing of the info, I am interested in extending the event model and pipelining to data analysis for provenance purposes. So I would like to create some kind of analysis tree, get it to some point where I do a deliberate export (🥇 ), then have someone else come along and get the DE from it s doi, and then rerun it and work on it. That is a big lift obviously, but let's at least move in that direction. 2) I think the bundling of data (or data-streams) contains a lot of the complexity of any data analysis. It is the kind of munging step and often takes the majority of the time. To put that time in and then not have it as part of the provenance is counter-productive. 3) We are just starting to "play" with these concepts, and don't know the best way either to do them, or capture them, so this will evolve over time. From which point of view maybe it isn't the best to PR it into databroker, but CJ has his reasons. However, conceptually, the munging and filtering could be thought of as legit data analysis steps on the graph, not so different than "subtracting background" or sthg.

Not sure if these comments help but....

sbillinge commented 6 years ago

btw, I am spending a lot of time these days working with records databases for my group activities....grants, people, appointments etc..

Something that I am enjoying using is making simple filter functions and then chaining them. For example, I want all my grant applications that are (current AND funded) so I do this by: grdb = load_up("grdb") use1 = is_current(grdb) use = is_key_has_value(use1,"application_status","approved")

This sequence would give AND. For OR I would give grdb as the argument in each case, and then I would chain the results.

For me this is proving to be a powerful and intuitive way to do the munging. Not sure if it scales performance-wise and so on, so building complicated mongo queries and then applying them may in one go may end up being better, but for now, this is working for me.

tacaswell commented 6 years ago

@sbillinge you should have a look at https://toolz.readthedocs.io/en/latest/api.html which provides functional (streaming when possible) primitives for many of those operations along with https://docs.python.org/3.6/library/itertools.html and the built in filter https://docs.python.org/3.6/library/functions.html#filter

tacaswell commented 6 years ago

I am not sure we want to hand 'query' like things into pipelines as then the pipeline code would have to know what to hand that query to!

From the PoV of provenance, why does the query used to find the headers matter?

I think the bundling of data (or data-streams) contains a lot of the complexity of any data analysis. It is the kind of munging step and often takes the majority of the time. To put that time in and then not have it as part of the provenance is counter-productive.

One of the goals of the Document model is to do as much of the 'correct' bundling at collection time and avoid having to do large amounts of multi-start merging operations at the top of analysis pipelines or single start 'forking' (particularly based on sequence number!). Can you push any of this information back into xpdacq and how it manages creating start documents / streams?

I think everyone wants 1 and definitely agree on 3

danielballan commented 5 years ago

The Catalog of search results returned by intake knows its own query, which I think addresses the original prompt here.