bluesky / databroker

Unified API pulling data from multiple sources
https://blueskyproject.io/databroker
BSD 3-Clause "New" or "Revised" License
34 stars 46 forks source link

Support dereferencing in start documents #371

Open CJ-Wright opened 6 years ago

CJ-Wright commented 6 years ago

Would it be possible to support dereferencing external data in start documents?

The idea here would to have external databases (eg a sample database) which would hand an id and some reference to itself to the databroker and upon request the databroker could replace the id with the actual data from the database.

This would be very helpful for at least two use cases:

  1. Adding information from a sample database
  2. Showing raw start document data in analyzed data headers (for ease of reading and interpretation by users)
stuartcampbell commented 6 years ago

Does the ext do what you want ? http://nsls-ii.github.io/databroker/whats_new.html#enhancements

danielballan commented 6 years ago

Fortunately, databroker already has all the pieces you need to give this a try. Use external_fetchers (see the release notes -- we have intentionally not documented it further in case we need to rip it out). Then write a function that takes in a Header and does the dereferencing.

danielballan commented 6 years ago

Due to a stale browser tab, I didn't see @stuartcampbell's comment until mine posted. We are on the same page. :- )

CJ-Wright commented 6 years ago

How do I get that de-referencing when data is going into a callback from the run engine? Edit1: Especially since we might not have a run stop document for a while.

Edit2: Having looked at the source code I'm not certain this does what I need.

  1. The data is reported external to the document model. Everything I (we?) have built so far relies on the document model so going out of the document model for this seems counter intuitive.
  2. Taking in both a start and a stop requires the run to be finished before any external references can be fetched, which causes problems with live running analysis.
  3. There are no links back to the source of the data in the documents themselves (or even hints of where that data might come from) which seems to allow for the writing of externally managed keys and then forgetting where they came from or how to talk to them. Essentially parts of the documents become unintelligible to anyone else (including me from the future :man_astronaut: ).
danielballan commented 6 years ago

The idea here would to have external databases (eg a sample database) which would hand an id and some reference to itself to the databroker and upon request the databroker could replace the id with the actual data from the database.

Our code solves the second part. You have to solve the first part -- shoving the relevant ID into the start document.

CJ-Wright commented 6 years ago

I think I can provide the ID to the start document without issues.

tacaswell commented 6 years ago

This sounds more like wanting an extra EventSource?

The stop may be None.

One fun use case of this would be to use a function that when you access the raw data-broker goes and gets a list of all the derived datasets!

We could put ext into the start document, but that would require squatting one more key.

CJ-Wright commented 6 years ago

I don't know if it should be an event source or not, but the rest sounds good.

CJ-Wright commented 6 years ago

Would it be possible to attach a ext factory to the databroker so that every header that came out of that databroker had ext attached? That way we don't need to take every header and attach ext externally.

danielballan commented 6 years ago

I think what you have described is exactly how ext already works. Is it? If not, please clarify.

CJ-Wright commented 6 years ago

I thought the ext was defined in a per header rather than per broker basis?

danielballan commented 6 years ago

The "ext factory" is attached to the Broker, and thereafter all Headers returned by that Broker have contents in header.ext. (Try it!)