Quasars / orange-spectroscopy

Other
51 stars 58 forks source link

Variable Selection for Spectral Analysis #420

Open realrk95 opened 4 years ago

realrk95 commented 4 years ago

Widget Info

Questions I have:

  1. Plot Problem: I have two orange tables self.data and self.new_data. In the widget, I want to display the data in a certain format ie. show self.data and highlight/mark attributes in self.new_data (which contains some or all attributes in self.data). Is there a widget which does this?

I was thinking of repurposing the OWSpectra widget for this. Initialize owspectra.py widget inside the mainArea of this widget and pass self.data inside. Is this a good idea? I don't want to write the entire code of plotting the lines, making them interactive etc. Do you think we can add a function in owspectra.py so it takes modified data table from the original and highlight regions from the modified table?

  1. Do you think this widget would be valuable to this community? I personally found myself matching overtones with certain regions and trying to cut areas with noise to improve mse. In my wheat model, R2 went from 0.91 to 0.96 after running the optimization and feeding only the selected wavelengths for the regression.

  2. Changes can be made so that classification as well as regression are supported for this widget. Instead of using PLS in the optimizer we can use neural networks to construct another optimizer for classification. Is this of value?

You can find an image attached of what I'm working on. This is what I am expecting to do in the end: highlight bars of data to show the relevant features and the corresponding functional groups in the plot itself. Sadly, pyqtgraph does not have an inbuilt function for this and I will have to build one from their Graphics library.

PS: I really wish plotly was supported in Orange. A lot of my previous work was done on plotly and matplotlib. Screenshot (159)

realrk95 commented 4 years ago

@markotoplak @borondics What do you think?

borondics commented 4 years ago

Hi @realrk69,

Nice stuff, I think it is really shaping up.

For display, I think we should either extend the Spectra widget or add a plot area to PLSR based on Spectra extended with the labels functionality. From your image I see that you are probably doing this and I think we do exactly this in Hyperspectra. Then, the only thing remaining to implement is the highlighted area plotting.

@markotoplak which one do you think is the best? I think using Spectra might be more modular as the labeling feature could be used for other purposes, but it will become quite a monster widget...

A general question (@markotoplak, @stuart-cls): should we try to add the functionality of plotting lines in Spectra using individual markers instead of continuous lines? It would be useful in cases like the plotting of components for PLS and PCA too.

As for plotting, yeah, higher-level libraries are a lot easier to work with, but as far as I understand pyqtgraph is a lot faster than any of those.

The widget is very useful for the community and I think having Classification and Regression capability would be great too!

Some other feature requests:

realrk95 commented 4 years ago

@borondics Thanks. I've extended spectra into this widget. The structure of the hyperspectra is informative. I read the datasheet(s) of pyqtgraph and understood how to implement the boxes with labels on top, I'll try to finish the PR before the end of this week.

I tried embedding neural network into the widget and it's working well so far for classification. I'll send screenshots. Currently, more than 1 target variable is not supported.

Other features:

markotoplak commented 4 years ago

@realrk69, are you aware that seeing the variable contributions can already be done for PLS (and also some other models)?

Screenshot 2020-04-27 at 11 11 42

As the cutoffs are often arbitrary, showing variable weights are more useful. So, from my point of view, the suggested widget is too specific and would make a visualization with less information than the workflow shown.

I do agree that the current workflow is convoluted and that there should be a simpler option.

So, to make that widget useful, it would need to:

About graphing. This is done by using CurvePlot and adding markings to it. A good example with complex markings (additional curves) is included in EMSC preprocessor.

realrk95 commented 4 years ago

@markotoplak I understand, and honestly had no idea this visualization could be done through rank. This is good. But if variable importances are already being plotted through the rank widget, is it needed in this?

I wanted to focus the widget on the variable selection of data since the objective of it is outputting in/out data and the optimum settings for the selected model to achieve the lowest MSE for the given data. I was thinking the visualization of features could be of additional value and would help in functional group detection and labeling.

Example: Brix (sugar) in fruits has certain regions where it shows the most relevance since fruit sugars are primarily composed of glucose, fructose and sucrose (all containing C-H and O-H bonds). Selecting only these wavelengths/wavenumbers, after relevant pre-processing was done, improved the MSE from 1.5 to 0.7 of the PLS model. It also recommended 7 as the optimum number of components.

I can modify the selection to work with some other scorers (linear logistic regression, SGD, random forest, pls all support score output). Come to think of it, is there a widget to display R2 and MSE with the increasing number of components or different settings? Kind of an automated iteration of all combinations of settings.

Let me know if it is a good idea to continue working on this widget or if this should be extended inside PLSR as @borondics recommended. I don't want to develop something which no one will use.

realrk95 commented 4 years ago

I'm done with the first iteration. This works perfectly for PLSR. Taking your suggestions into account:

@markotoplak

  1. Added visualization for variable importances
  2. Will start working to add compatibility of other models (like in Rank)

@borondics

  1. Building documentation side by side
  2. I guess these were the visualizations you were talking about.

I used LinearRegionItem for this widget. If you compare the green area (selected) vs the red and refer to the overtone chart, you'll get mostly NH2 bonds at those wavelengths (1000-1050 nm shows the strongest regression, and is the main band for NH2 in the third overtone, occurring with minimal overlaps)

I'll create an initial PR or add it to my Github for your review, whichever's comfy with you guys. Pyqtgraph has a steep learning curve, but it is much faster than other's I've used.

Screenshot: Screenshot (160)

realrk95 commented 4 years ago

@markotoplak @borondics Any comments?

realrk95 commented 4 years ago

Guys, please let me know if I should continue working on this. As is written in the contributing guidelines, I don't want to contribute something which won't be merged/accepted because of its pointlessness. Awaiting your reply

markotoplak commented 4 years ago

@realrk69, this makes sense and looks very informative. Much better for spectral data that OWRank.

How do you set the number of features/threshold? The graph below could also have a horizontal line showing the cut (which perhaps the user could move.

realrk95 commented 4 years ago

Okay, sounds good. Will create PR soon.

The optimization process currently:

  1. Loop over the max components mentioned by user
  2. Sort indices in spectra in ascending order of pls coef
  3. Remove wavelengths one by one and calculate mse simultaneously
  4. Output number of components to achieve minimum mse
  5. Uses the lowest achieved mse point to draw the line below which all absolute pls coef value corresponding wavelengths are removed.

Addition: You're right, maybe by have a horizontal line movable by user he can set the limit of the coefficients to the point he wants. Not sure how to implement this though, will have to do some reading. Also, then the total number of optimum components will also change.

markotoplak commented 4 years ago

To see how to do moveble lines try to move intergral limit lines (red lines at the edge of the spectra below) in OWHyper.

I would actually keep optimization/algorithm and this display separate. This widget, I think, should only do display.

For example, PLS should have an option to optimize number of components. Your widget should input a Learner. Then you will be able to connect PLS's Learner output to this widget. If this widget takes a Learner in the same was as OWRank does, then it will be general and also usable with and linear methods and Random forests.

Please see what does OWRank do. Try to repeat the workflow I once shared. I think your widget should only do a better display of importances for spectral data.

Tell me if this makes sense to you.

realrk95 commented 4 years ago

Just to be clear, you're recommending:

  1. PLS widget should have option to optimize the number of components
  2. Variable Selection should input a model and output the plot and optimized data

This separation can be done. It's now working with Linear, Logistic, and PLSR. But not with others. I'm trying to add compatibility, but it will take some time.

markotoplak commented 4 years ago

@realrk69, yes. Even more precisely, I am recommending that Variable Selection should input a Orange Learner (something that fits models - not a fitted model).

Well, you can not make it work with everything, because this functionality is method dependant. I would go for making it compatible with descendants of LearnerScorer and only using its interface to obtain scores. Apart from the ones you listed, at least Random Forests is also a LearnerScorer. If done this way, your widget will have no model-specific code, which will make it very flexible.