European-XFEL / EXtra-data

Access saved EuXFEL data
https://extra-data.rtfd.io
BSD 3-Clause "New" or "Revised" License
7 stars 14 forks source link

Add warning when select with require_all discards most trains #497

Closed takluyver closed 8 months ago

takluyver commented 9 months ago

We sometimes see an instrument source which is in the run but has no data for any train. If you run.select(..., require_all=True) including such a source, you get a valid selection with 0 trains, which is surprising, and you then have to work out which source cause it.

With this change, the selection will come with a warning like:

480/480 trains lost when filtering by FXE_XAD_GEC/CAM/CAMERA_NODATA:daqOutput

I've currently made this appear when filtering by a single source drops more than half the trains we had just before. IDK if that's right, e.g. with HED-style experiments where only a very few pulses are important. Another option is to show the warning only if we're deselecting all trains.

Naturally we could also make it configurable, but I hope we can have a default that works for 99% of use cases without config.

tmichela commented 8 months ago

with HED-style experiments where only a very few pulses are important.

I think that should be fine with HED data, as currently we get the train ID of interest and filter the run based on that: run.select_trains(by_id[[tid]]), eventually in future something like: run.select_trains(PPU).

That said, I think I'd favor warning only if all trains a dropped by default and make it configurable if we want a higher threshold.

takluyver commented 8 months ago

OK, the warning will only show by default if all trains are dropped. Configuring it looks like: .select(..., require_all=True, warn_drop_trains_frac=0.9).

The fraction is out of the trains left just before selecting each source, not from the original number. So if you select a lot of sources with different trains missing, many trains could be dropped without any one source triggering the warning. There's potential for confusion there, but I don't think it's too likely to come up, and I think there's potential for confusion with any option. .plot_missing_data() is a better way to get an overview of what data is missing.

takluyver commented 8 months ago

(Got LGTM in chat)