bluesky / event-model

data model for event-based data collection and analysis
https://blueskyproject.io/event-model
BSD 3-Clause "New" or "Revised" License
13 stars 30 forks source link

Added InvalidData and MismatchedDataKeys exceptions #152

Closed gwbischof closed 4 years ago

gwbischof commented 4 years ago

Fixes #151

danielballan commented 4 years ago

That's a good question. I think that consumers and readers should not perform any proactive validation. If they hit an unrecoverable error, it's worth doing some extra investigation to provide a specific error message, as this PR does, but I think it is fair to assume that the documents they read are valid unless proven otherwise.

As I mentioned on Slack but will rehash here for the sake of a public record, I think we should validate when the documents are composed (in event_model.compose_run) and maybe when they are persisted (in the suitcases). But any validation work that we do when the documents are accessed is needlessly repeated every time they are accessed, and the notice also comes too late. It's better to fail when the documents are first created and/or persisted because the user is immediately notified about the problem, and you only pay for the performance overhead once.

So, as far as the Filler and databroker are concerned, I think it's OK to have a situation where malformed documents lead to keys silently not being filled. As a separate matter, we should review our validation functions to ensure that they would flag malformed documents at compose/write time, and then we should comb the existing databases for irregularities and decide what to do about the invalid documents that we find.

[Edit: cc @tacaswell for a gut-check here.]

gwbischof commented 4 years ago

The old databroker iterates the event['data'].keys() instead of descriptor['data_keys'].keys() (the way it is done here in event model) https://github.com/bluesky/databroker/blob/master/databroker/headersource/mongo_core.py#L56 Doing it this way could prevent it from failing silently.

tacaswell commented 4 years ago

The gut checks out. Don't fully follow the last paragraph about the Filler though.

tacaswell commented 4 years ago

I think the code would be simpler if we included the keys of the descriptor, the data, filled, (and the timestamps) with the same text rather than doing the book keeping to raise just the rightly worded exception. All we really know is that something that is consistent is not, but we can not really be sure which of the inconsistent things in right (and going with majority voting is not always correct).