[ENH] Minimum Noise Fraction denoising preprocessor - Githubissues

Quasars / orange-spectroscopy

Other

52 stars 58 forks source link

[ENH] Minimum Noise Fraction denoising preprocessor #729

Closed borondics closed 4 weeks ago

borondics commented 2 months ago

Added a new denoising method with the help of SOLARIS colleagues.

The paper it is based on is this.

markotoplak commented 2 months ago

Also add the preprocessor to test_preprocess.PREPROCESSORS_INDEPENDENT_SAMPLES for torture tests.

Also, how did you ensure that the code in transform does what you would want? If you did any comparisons with existing implementations, make a test that encompases that comparison.

borondics commented 2 months ago

Thanks for the suggestions!

I did both.

However, should we also add PCADenoising to test_preprocess.PREPROCESSORS_INDEPENDENT_SAMPLES? It wasn't there.

For the test I used a small script that was implemented by colleagues to get the calculated values for iris and took the numbers. I will also verify this for a spectral dataset.

Some questions: PyLint complains about the MNFDenoising.sd, which seems to be unnecessary but it is also in PCADenoising... Should we remove both?

borondics commented 2 months ago

Ha! The new tests didn't pass. :D

markotoplak commented 2 months ago

Sorry, I was wrong, these is not an independent samples preprocessor, just like PCA is not, I should have said PREPROCESSORS_GROUPS_OF_SAMPLES

borondics commented 2 months ago

Moved it but there is still some problem. Possibly not only with the new addition...

stuart-cls commented 2 months ago

@borondics The peakfit failures are due to #720 which should be resolved as soon as lmfit 1.3.2 is released.

markotoplak commented 1 month ago

@borondics, please rebase to master so that tests will (mostly) work.

borondics commented 1 month ago

The MNF related tests are passing on my computer but I can do this various ways. @markotoplak, right now I am using a try and raising errors while still returning the original data if the MNF fails. What do you say?

borondics commented 1 month ago

OK, I think now the MNF tests are also passing here.

markotoplak commented 1 month ago

@borondics, some tests fail. Your code can not handle unknowns. You probably need to extend CommonDomainOrderUnknowns instead of CommonDomain and that is it then. This was not needed for PCA because Orange's PCA handles them.

But if I think about it, even PCA denoising should use that one, because the interpolation it uses is better for spectra.

markotoplak commented 1 month ago

Another test fails because results of this preprocessor are so unstable that small changes in data produce very different values. Is that expected? If so, we could ignore it in that test.

BTW, this is a sign of high instability of the method and makes its usability questionable.

markotoplak commented 1 month ago

That "testing for nans" commit is probably unnecessary if you extend the correct class and anyway, should not be done so. If you ever need something like this, make try: except" cover the least code possible.

markotoplak commented 1 month ago

@borondics, I rebased and made this robust to unknown values as I pointed out above.

Now only one tests is failing.

I also find the original (and current) computation code suspicious. When computing the N matrix all differences were computed and then there was additional vector of zeros. This seems like a bug, and your tests also pass if I say N = diffs. How did you verify if the code is fine?

I have some additional suspicions due to the non-passing test. :)

markotoplak commented 4 weeks ago

@borondics, I am merging this assuming that the failed test is not problematic (I just ignored it).

Please, consider again if the computation is correct. The sensitivity of the method to small perturbations in the data (the sensitivity is problematic even on bigger data, like the whole collagen).

Even if the implementation is correct, I would not use the method on data like collagen because of its sensitivity to noise (if the method is meant to be applied to different kinds of data - perhaps already preprocessed in a certain way - it should be, of course, tested with that).