Add computed metrics to complement visual inspection

brownritt commented 4 years ago

Quick followup to make official the feature request chat we just had at NWB Hackathon (it got pretty long once I started typing it out, sorry...).

A possibly useful feature would be to automatically compute (probably controlled by checkbox to avoid heavy computation when users don't want it) some set of validation metrics to help with data cleaning. Especially for large channel count or long duration projects, an enormous amount of effort goes into manual validation/cleaning, for which visual inspection of raw data is essential but also limited. Some assistance by automated summaries could make the job much less tedious.

Example I gave in the chat was detection of bad channels in multielectrode recordings (could check for a large number of samples saturated at the data range min/max, or a large number of two-sample jumps beyond some threshold). Some acquisition systems flag bad channels in the data file, so those flags could be displayed in a table when available. Another useful number to have summarized across many channels is just the variance, to look for outliers. Idea would be to have a dashboard that compactly summarizes such numbers/boolean flags across all channels, from which users can drill down to raw displays to investigate further.

I'll have to think a bit more about a good list of desired checks for a trial implementation, but hopefully there's some value in just starting the conversation. My first thought would be to create some simple scripting text box, where users can type/paste in what they want checked and displayed, since it would be very difficult to devise and maintain a top-down list of what different labs with different data streams would actually want and use. Something like (in a totally made up pseudocode):

Saturated(bool): prob( isclose( abs(data), DATA_MAX ) ) > 0.01
Jumps(bool): prob( absdiff(data) > 0.01*DATA_MAX ) > 1e-3
Spread(float): std(data)

where each line is a separate test, and NWBExplorer knows to loop through the checks for each channel, and display them in some sensible compact fashion (table of colored blocks for True/False across all channels; table of numbers+colormap for continuous quantities, maybe with a graphical flag for outliers). The first symbol is an arbitrary user-specified name, with data type given in parens, and the RHS is the thing to be computed. Everything is implicitly an array over channels.

The scripting would recognize typical math operations and functions like absolute value. Additionally, operations like prob would compute the fraction of samples (in a particular channel) for which their argument is true. An extra flourish would be creating some named constants either read from NWB metadata when available, or automatically computed from channels, for example, DATA_MAX as the largest possible or observed sample value.

Again, I think value added would be

automating the looping through channels
abstracting sample operations in functions like prob so no indexing/slicing needed
organizing in a decent looking visualization without explicit plot commands

in order to lower the coding burden and tedium of initial validation of data sets. It might be most useful for working through many files (across animals, sessions, etc), each of which has large channel count data.

A lower priority todo could be wrapping the scripting in GUI elements, for people who want to avoid all coding, and/or enabling some default scripts (which people could view/edit only if desired). As a simple starting point, I suggest just a text box to be parsed as above.

Another secondary todo could be enabling download of a machine readable report to be used in later data analysis (e.g. an array or pandas dataframe of the channel-wise computed checks). This would allow someone to switch to cleaning in their pipeline without having to retype/manually copy lots of information from the visualization.

A limitation is that the above is channel-wise, and wouldn't help to detect, say, if there's a minute of bad motion artifact at some time within a half-hour session. Off the top of my head not sure how complicated it would be to add time as well as channel semantics.

Similarly, could be useful but also too complicated to implement checks across channels (e.g. to detect common-mode artifacts in EEG).

tarelli commented 4 years ago

Thank you @brownritt!

pgleeson commented 4 years ago

Thanks for the suggestion @brownritt. The first practical issue is that NWBExplorer is currently effectively read only. A first step is an enhancement here: #170 to allow editing of the nwb file and downloading the new file with the updates. There would still be some scripting required probably, but it would make one way for the interface to be used for data checking/validation.

brownritt commented 4 years ago

Hmm. Your comment raises two issues, I think: adding editing capabilities to OSB (I like the cleanup-before-release use case from the other issue), and possibly implementing the scripting above as an NWB extension rather than something specific to OSB.

That is, make a new "validation" data type, building from the scripting description above, that can be added to any NSB file. Then all OSB would need to do is parse/render it. If other applications also want to add rendering or analysis capabilities (in widgets, or in a pipeline to do cleaning) they can read from the same script data. This eliminates the sequential dependence of validation-rendering and editing todos for OSB enhancements (both are still good, but could be done on any timeline), and also could contribute a more general benefit to NWB.

MetaCell / nwb-explorer

Add computed metrics to complement visual inspection #169