Variable Selection for Spectral Analysis

realrk95 commented 4 years ago

Widget Info

It detects features (attributes) which have relevance to the target (class variables) by running pls regression iterating over max components being defined by the user. It recreates the domain with optimized data and uses the Table.from_table to create a new table.
It automatically detects the primary molecular functional groups occuring at that attribute (if the attributes are ascending or descending ints/floats of wavelengths/wavenumers, if attributes are strings, functional group detection is disabled). I have a list of functional groups in NIR also refered to commonly as the NIR overtone chart which I'm digitizing, so if the attributes 'selected' by the widget are at a particular range, say 1400-1500 nm, it will show the primary functional groups present there ie. water peaks (H2O)
The 'selected' features are then scanned in ranges and the functional groups should be shown to the users (pending implementation). It can be shown as text or on top of the graph where the data is highlighted.
This removes attributes which make no sense to the model (noise, other peaks), so the user or the model can focus on the most relevant parts which contribute towards the target variable values.
Further implementations for Raman and IR libraries can be made if their overtone or peak charts are made available, I have some information, but need more.
Outputs optimum number of components or parameters to achieve lowest MSE for the relevant model.

Questions I have:

Plot Problem: I have two orange tables self.data and self.new_data. In the widget, I want to display the data in a certain format ie. show self.data and highlight/mark attributes in self.new_data (which contains some or all attributes in self.data). Is there a widget which does this?

I was thinking of repurposing the OWSpectra widget for this. Initialize owspectra.py widget inside the mainArea of this widget and pass self.data inside. Is this a good idea? I don't want to write the entire code of plotting the lines, making them interactive etc. Do you think we can add a function in owspectra.py so it takes modified data table from the original and highlight regions from the modified table?

Do you think this widget would be valuable to this community? I personally found myself matching overtones with certain regions and trying to cut areas with noise to improve mse. In my wheat model, R2 went from 0.91 to 0.96 after running the optimization and feeding only the selected wavelengths for the regression.
Changes can be made so that classification as well as regression are supported for this widget. Instead of using PLS in the optimizer we can use neural networks to construct another optimizer for classification. Is this of value?

You can find an image attached of what I'm working on. This is what I am expecting to do in the end: highlight bars of data to show the relevant features and the corresponding functional groups in the plot itself. Sadly, pyqtgraph does not have an inbuilt function for this and I will have to build one from their Graphics library.

PS: I really wish plotly was supported in Orange. A lot of my previous work was done on plotly and matplotlib. Screenshot (159)

realrk95 commented 4 years ago

@markotoplak @borondics What do you think?

borondics commented 4 years ago

Hi @realrk69,

Nice stuff, I think it is really shaping up.

For display, I think we should either extend the Spectra widget or add a plot area to PLSR based on Spectra extended with the labels functionality. From your image I see that you are probably doing this and I think we do exactly this in Hyperspectra. Then, the only thing remaining to implement is the highlighted area plotting.

@markotoplak which one do you think is the best? I think using Spectra might be more modular as the labeling feature could be used for other purposes, but it will become quite a monster widget...

A general question (@markotoplak, @stuart-cls): should we try to add the functionality of plotting lines in Spectra using individual markers instead of continuous lines? It would be useful in cases like the plotting of components for PLS and PCA too.

As for plotting, yeah, higher-level libraries are a lot easier to work with, but as far as I understand pyqtgraph is a lot faster than any of those.

The widget is very useful for the community and I think having Classification and Regression capability would be great too!

Some other feature requests:

Documentation :) We are pretty bad at this point, so I try to encourage everyone to write some for their widgets.
It would be great to be able to visualize the loadings.

realrk95 commented 4 years ago

@borondics Thanks. I've extended spectra into this widget. The structure of the hyperspectra is informative. I read the datasheet(s) of pyqtgraph and understood how to implement the boxes with labels on top, I'll try to finish the PR before the end of this week.

I tried embedding neural network into the widget and it's working well so far for classification. I'll send screenshots. Currently, more than 1 target variable is not supported.

Other features:

Documentation: I'll give it my best shot. I saw the documentation format of other widgets and will add similar files to this widget.
Visualize loadings: Do you mean the optimized data? In this case, the widget outputs the optimized data table (basically removes the redundant attributes). If this is not what you meant, could you explain with an example?

markotoplak commented 4 years ago

@realrk69, are you aware that seeing the variable contributions can already be done for PLS (and also some other models)?

Screenshot 2020-04-27 at 11 11 42

As the cutoffs are often arbitrary, showing variable weights are more useful. So, from my point of view, the suggested widget is too specific and would make a visualization with less information than the workflow shown.

I do agree that the current workflow is convoluted and that there should be a simpler option.

So, to make that widget useful, it would need to:

Allow inputs of custom models, same as Rank does.
Display variable importances with a curve, not just in/out. So your display should the data + some other curves from the classifier.

About graphing. This is done by using CurvePlot and adding markings to it. A good example with complex markings (additional curves) is included in EMSC preprocessor.

realrk95 commented 4 years ago

@markotoplak I understand, and honestly had no idea this visualization could be done through rank. This is good. But if variable importances are already being plotted through the rank widget, is it needed in this?

I wanted to focus the widget on the variable selection of data since the objective of it is outputting in/out data and the optimum settings for the selected model to achieve the lowest MSE for the given data. I was thinking the visualization of features could be of additional value and would help in functional group detection and labeling.

Example: Brix (sugar) in fruits has certain regions where it shows the most relevance since fruit sugars are primarily composed of glucose, fructose and sucrose (all containing C-H and O-H bonds). Selecting only these wavelengths/wavenumbers, after relevant pre-processing was done, improved the MSE from 1.5 to 0.7 of the PLS model. It also recommended 7 as the optimum number of components.

I can modify the selection to work with some other scorers (linear logistic regression, SGD, random forest, pls all support score output). Come to think of it, is there a widget to display R2 and MSE with the increasing number of components or different settings? Kind of an automated iteration of all combinations of settings.

Let me know if it is a good idea to continue working on this widget or if this should be extended inside PLSR as @borondics recommended. I don't want to develop something which no one will use.

realrk95 commented 4 years ago

I'm done with the first iteration. This works perfectly for PLSR. Taking your suggestions into account:

@markotoplak

Added visualization for variable importances
Will start working to add compatibility of other models (like in Rank)

@borondics

Building documentation side by side
I guess these were the visualizations you were talking about.

I used LinearRegionItem for this widget. If you compare the green area (selected) vs the red and refer to the overtone chart, you'll get mostly NH2 bonds at those wavelengths (1000-1050 nm shows the strongest regression, and is the main band for NH2 in the third overtone, occurring with minimal overlaps)

I'll create an initial PR or add it to my Github for your review, whichever's comfy with you guys. Pyqtgraph has a steep learning curve, but it is much faster than other's I've used.

Screenshot: Screenshot (160)

realrk95 commented 4 years ago

@markotoplak @borondics Any comments?

realrk95 commented 4 years ago

Guys, please let me know if I should continue working on this. As is written in the contributing guidelines, I don't want to contribute something which won't be merged/accepted because of its pointlessness. Awaiting your reply

markotoplak commented 4 years ago

@realrk69, this makes sense and looks very informative. Much better for spectral data that OWRank.

How do you set the number of features/threshold? The graph below could also have a horizontal line showing the cut (which perhaps the user could move.

realrk95 commented 4 years ago

Okay, sounds good. Will create PR soon.

The optimization process currently:

Loop over the max components mentioned by user
Sort indices in spectra in ascending order of pls coef
Remove wavelengths one by one and calculate mse simultaneously
Output number of components to achieve minimum mse
Uses the lowest achieved mse point to draw the line below which all absolute pls coef value corresponding wavelengths are removed.

Addition: You're right, maybe by have a horizontal line movable by user he can set the limit of the coefficients to the point he wants. Not sure how to implement this though, will have to do some reading. Also, then the total number of optimum components will also change.

markotoplak commented 4 years ago

To see how to do moveble lines try to move intergral limit lines (red lines at the edge of the spectra below) in OWHyper.

I would actually keep optimization/algorithm and this display separate. This widget, I think, should only do display.

For example, PLS should have an option to optimize number of components. Your widget should input a Learner. Then you will be able to connect PLS's Learner output to this widget. If this widget takes a Learner in the same was as OWRank does, then it will be general and also usable with and linear methods and Random forests.

Please see what does OWRank do. Try to repeat the workflow I once shared. I think your widget should only do a better display of importances for spectral data.

Tell me if this makes sense to you.

realrk95 commented 4 years ago

Just to be clear, you're recommending:

PLS widget should have option to optimize the number of components
Variable Selection should input a model and output the plot and optimized data

This separation can be done. It's now working with Linear, Logistic, and PLSR. But not with others. I'm trying to add compatibility, but it will take some time.

markotoplak commented 4 years ago

@realrk69, yes. Even more precisely, I am recommending that Variable Selection should input a Orange Learner (something that fits models - not a fitted model).

Well, you can not make it work with everything, because this functionality is method dependant. I would go for making it compatible with descendants of LearnerScorer and only using its interface to obtain scores. Apart from the ones you listed, at least Random Forests is also a LearnerScorer. If done this way, your widget will have no model-specific code, which will make it very flexible.

Quasars / orange-spectroscopy

Variable Selection for Spectral Analysis #420