Closed takluyver closed 8 months ago
with HED-style experiments where only a very few pulses are important.
I think that should be fine with HED data, as currently we get the train ID of interest and filter the run based on that: run.select_trains(by_id[[tid]])
, eventually in future something like: run.select_trains(PPU)
.
That said, I think I'd favor warning only if all trains a dropped by default and make it configurable if we want a higher threshold.
OK, the warning will only show by default if all trains are dropped. Configuring it looks like: .select(..., require_all=True, warn_drop_trains_frac=0.9)
.
The fraction is out of the trains left just before selecting each source, not from the original number. So if you select a lot of sources with different trains missing, many trains could be dropped without any one source triggering the warning. There's potential for confusion there, but I don't think it's too likely to come up, and I think there's potential for confusion with any option. .plot_missing_data()
is a better way to get an overview of what data is missing.
(Got LGTM in chat)
We sometimes see an instrument source which is in the run but has no data for any train. If you
run.select(..., require_all=True)
including such a source, you get a valid selection with 0 trains, which is surprising, and you then have to work out which source cause it.With this change, the selection will come with a warning like:
I've currently made this appear when filtering by a single source drops more than half the trains we had just before. IDK if that's right, e.g. with HED-style experiments where only a very few pulses are important. Another option is to show the warning only if we're deselecting all trains.
Naturally we could also make it configurable, but I hope we can have a default that works for 99% of use cases without config.