diffpy / diffpy.snmf

Other
1 stars 6 forks source link

Double check formula for adding `lift_data` to input signal data #130

Open bobleesj opened 4 weeks ago

bobleesj commented 4 weeks ago

Problem

lift_data is applied to input signals below:

def main():
    args = create_parser()
    # Load input data, prepare tuple((1850, 1) and (1850, 50))
    grid, input_data = load_input_signals(args.input_directory)

    # Apply lifting factor to input signal data
    lifted_input_data = lift_data(input_data, args.lift_factor)
    ...

The following is the implementation of lift_data:

def lift_data(data_input: np.ndarray, lift=1):
    """Lifts values of data_input.

    Adds 'lift' * the minimum value in data_input to data_input element-wise.

    Parameters
    ----------
    data_input: 2d array like
      The matrix containing a series of signals to be decomposed. Has dimensions N x M where N is the length
      of each signal and M is the number of signals.

    lift: float
      The factor representing how much to lift 'data_input'.

    Returns
    -------
    2d array like
      The matrix that contains data_input - (min(data_input) * lift).
    """
    return data_input + np.abs(np.min(data_input) * lift)

While the the function is expected to return data_input - (min(data_input) * lift) the actual code is written as return data_input + np.abs(np.min(data_input) * lift)

Proposed solution

Double check literature (if there is one)

Fil158 commented 4 weeks ago

Ok, I'll do it and reply to you

sbillinge commented 3 weeks ago

Also, since this is an additive correction and not a multiplicative one, let's not call it a factor. We could use term but it doesn't seem so clear in this context, or maybe quantity or amount or value? so lift_amount or lift_value. I leave the decision up to you guys.

@bobleesj thanks for catching that. Was it not caught by a test though? that is not good.....it probably means that the tests were written by running the function, getting the result, then coding that into the test! ugh! Or maybe it is a case where we didn't have test. Not sure, I didn't look. In general folks, tests first, code after.....

Fil158 commented 3 weeks ago

I searched the literature a bit: I couldn't find anything equal but I did find this method which has about the same function as ours and could be useful because it normalises spectra (it is mainly used in spectroscopy), for us it would be patterns.

import numpy as np
   def snv(data):
        # Vectorized approach to SNV
        mean = np.mean(data, axis=1, keepdims=True)  # mean of each row
        std = np.std(data, axis=1, keepdims=True)    # std of each row
        return (data - mean) / std

      # Usage
      normalized = snv(spectra)  # spectra is a 2D array where each row is a spectrum

Other methods are for background subtraction but this is not the case here.

I think lift_value is the best way name the term

bobleesj commented 3 weeks ago

Was it not caught by a test though? that is not good.....it probably means that the tests were written by running the function, getting the result, then coding that into the test! ugh!

I am afraid that's the case, especially given that all of the test functions do not use real data - all of the test inputs are dummy matrices, hard coded.

bobleesj commented 3 weeks ago

I searched the literature a bit: I couldn't find anything equal but I did find this method which has about the same function as ours and could be useful because it normalises spectra (it is mainly used in spectroscopy), for us it would be patterns.

import numpy as np
   def snv(data):
        # Vectorized approach to SNV
        mean = np.mean(data, axis=1, keepdims=True)  # mean of each row
        std = np.std(data, axis=1, keepdims=True)    # std of each row
        return (data - mean) / std

      # Usage
      normalized = snv(spectra)  # spectra is a 2D array where each row is a spectrum

Other methods are for background subtraction but this is not the case here.

I think lift_value is the best way name the term

sorry, I am a bit confused. we don't want to include new functions/features since the goal is to design our code in a way that reproduces the plots and figures from matlab. The questions is, do we data_input + np.abs(np.min(data_input) * lift) or data_input - np.abs(np.min(data_input) * lift) ? Maybe the matlab code has it? @Fil158

sbillinge commented 3 weeks ago

I searched the literature a bit: I couldn't find anything equal but I did find this method which has about the same function as ours and could be useful because it normalises spectra (it is mainly used in spectroscopy), for us it would be patterns.

import numpy as np
   def snv(data):
        # Vectorized approach to SNV
        mean = np.mean(data, axis=1, keepdims=True)  # mean of each row
        std = np.std(data, axis=1, keepdims=True)    # std of each row
        return (data - mean) / std

      # Usage
      normalized = snv(spectra)  # spectra is a 2D array where each row is a spectrum

Other methods are for background subtraction but this is not the case here. I think lift_value is the best way name the term

sorry, I am a bit confused. we don't want to include new functions/features since the goal is to design our code in a way that reproduces the plots and figures from matlab. The questions is, do we data_input + np.abs(np.min(data_input) * lift) or data_input - np.abs(np.min(data_input) * lift) ? Maybe the matlab code has it? @Fil158

yes, sorry for the confusion. This is not a research/literature issue, it is checking what ran did and making sure that we coded it up correctly.

Fil158 commented 3 weeks ago
%% remove baseline (lift)
data_input = Data - min(Data(:));

This is the code in matlab that removes the baseline by subtracting the minimum value of the entire data matrix from each element. It returns all values shifted such that the minimum value is 0.

Fil158 commented 3 weeks ago

So it seems to be more similar to this data_input - np.abs(np.min(data_input) * lift) but I'm afraid that subtraction could potentially cause more problems, so, maybe, it is better data_input + np.abs(np.min(data_input) * lift)

bobleesj commented 3 weeks ago

but I'm afraid that subtraction could potentially cause more problems

If we consider the matlab code as our source of truth, what other problems would it have?

Fil158 commented 3 weeks ago

I can think of a situation where you have 0s in the matrix and for some reason the code subtracts a minimum value calculated only on the strictly positive numbers and so you get negative values. It's probably my mindless paranoia....

bobleesj commented 3 weeks ago

@Fil158 Probably not? say, 1 - 1 * 0.1 is still positive? Assuming the data input is positive and the lift factor is less than 1 (you could double check this for me).

Fil158 commented 3 weeks ago

As I understand it, non-negative values are required to make NMF, not necessarily strictly positive and >1, so I think could work

bobleesj commented 3 weeks ago

Ok could you correct the function? It will prob break tests as expected.

Also have you looked into the overall algorithm implemented in matlap and described in the paper? Wondering about the current status now

sbillinge commented 3 weeks ago

Let's code to the matlab. I don't see a reason why we don't just subtract the minimum. Other considerations are (a) code readability. I think subtracting the minimum is clearest here. (b) code performance. Probably not an issue, but it is usually a bad idea to take absolute values, especialiy in regression problems, because they are not differentiable in general and introduce instabilities when using gradient methods. Ran is a professional mathametician who is expert in regression problems, and makes some choices that are obvious to him without telling us why....in general we should go with what he did unless we have a very good reason to change.