Unexpected bug with my data

lionel42 commented 3 months ago

Hello,

I send you attached two files with problems I had with my two samples of my data.

The bugs occur during the scipy optimization.

I am simply running:

chrom = Chromatogram(df)
peaks = chrom.fit_peaks(buffer=0)

I tested your example in the documentation and everything worked for me. I don't see anything in the data which is a no-go for chromatography, so I would have expected the package to work ;)

Data: bugA.csv bugB.csv

Chromatograms:

FigureA FigureB

gchure commented 3 months ago

Hi @lionel42, thanks for opening an issue. Could you please provide more information as to what the bug is? What do you think should happen? What are the two chromatograms you are showing here?

gchure commented 3 months ago

Took a quick look at bugA.csv and was able to see it couldn't be properly fit (with an x0 is infeasible error, meaning there are problems in the bounds. I gave correct_baseline=False to the fit_peaks call and saw the fitting was "working", but with the wrong amplitude:

I noticed that the time dimension in your provided files are int, which is not something I had in any test case, and I notice these are very long time axes (which shouldn't matter). I rescaled the time axis to be evenly spaced between 0 and 10, and it worked:

df['time'] = np.linspace(0, 10, len(df['time']))
chrom = hplc.quant.Chromatogram(df)
peaks = chrom.fit_peaks()
chrom.show()

This also works with bugB.csv.

I don't understand what the mode of the failure is, but would like to figure it out! You should be able to get this to reliably work so long as you rescale that x axis coherently between samples.

Can you give me more information about these samples? Did you do any filtering or post-processing on the chromatograms before passing them into hplc-py? Why are both the time and signal domains integers, and what are their units?

lionel42 commented 3 months ago

Thanks for you help ! @gchure

The chromatograms come from a GC instrument that a collegue of mine runs, with its custom software (which I don't know). The integer values are probably simply dued to some rounding applied in the software that produces the data. But I feel that the time is not recorded in time unit, but as aquisition bin.

Just few suggestions so that newcomers like me don't have this issue

Apply some normalization to ensure the data is always constrained to the same boundaries for the optimization function ?
Force the dtype of the data given to be float for the optimization ?

Let me know if this make sense, and just in case I would be happy to contribute if you don't have time to implement that ;)

gchure commented 3 months ago

Feel free to fork and submit a PR. I don't think the type thing is the issue -- it should also be fine on int because the retention times are inferred as floats. I'm not sure that the rescaling of the peak windows is a robust solution, though that may work well. If you want to give that a shot, feel free. When I have a little more time on my hands (hopefully in a month or so), I'll also take a look.

lionel42 commented 3 months ago

I applied some normalization for my spectra before doing the optimization and found few interesting things:

# Code for the normalization 

    normalize = lambda x, scale=1.: (x - x.min()) / (x.max() - x.min()) * scale
    df_corrected["time"] = normalize(df["time"], scale=10.)
    df_corrected["signal"] = normalize(df["signal"])

I had a look to the source code but I did not come up with a simple solution where to include that normalization.

example1.csv

In this case scaling the signal axis seems to not have any impact. I tried scaling the time axis from different values:

0 to 1: I got a wrong fit
0 to 10: seems to work fine
0 to 100: fit failed

example2.csv

In this case, scaling up the signal to 100 made the algorithm fail.

If I have to conclude, I would say the algorithm is quite unstable and should be improved.

Having a "wrong" scale makes the algorithm go wrong, so it seems there is definitely a scale for which it works better and normalizing the data would greatly influence it.

Is there something like a dataset on which we could test the performance of different methods ? If not, I am happy to provide some peaks.

gchure commented 3 months ago

Thanks for the detail. I did not benchmark this package on GC data (which here has inherently different x-axis scaling), though I'm surprised this is showing up. I did benchmark the package heavily on both simulated chromatograms (see the extensive test suite) and on real experimental chromatograms (see the corresponding JOSS paper and our recent work applying this to complex environmental samples).

I appreciate your concern about normalization, as I have a big concern with it as well, but I do want to push back that the algorithm is "unstable". Could you share your raw GC chromatogram file so I can take a look at it? Could you also share any code you used to generate the csv's you have shared here?

As to where normalization should take place, I think the best place would be here where a normalization on the time domain should precede the optimization call. Immediately after the optimization call, the inverse transformation should take place to get the true retention times and areas back. I need to think about that transformation more, but it should just be a multiplicative operation.

lionel42 commented 3 months ago

@gchure Thanks for your answer.

Here are the files:

fileB.txt fileC.txt fileD.txt fileA.txt

And here the code. I was interested in some specific peaks (the largest ones) and comparing them between the files, so here is the code that does that. It performs many fits. To produce the chromatogromas, I used pandas.to_csv

# Put the path to the files here
folder = Path(r"...\folder")
files = [
    "fileA.txt",
    "fileB.txt",
    "fileC.txt",
    "fileD.txt",
]

files = [folder / file for file in files]

# Read with pandas
dfs = {
    filepath.stem: pd.read_csv(filepath, sep="\t", header=None, names=["time", "signal", "???"])
    for filepath in files
}
# I selected some rt intervals where the peaks are located:
peaks_to_extract = [
    (1380, 1440),
    (1870, 1920),
    (2360, 2405),
    (2405, 2470),
    (2590, 2650)

]

# Then simply running this on selected parts of the chromatogram
# Plot all the dfs on each others 
n_peaks = len(peaks_to_extract)
fig, axes = plt.subplots(n_peaks, figsize=(10, 5*n_peaks))
for ax, peak_intevrval in zip(axes, peaks_to_extract):
    print("Interval: ", peak_intevrval)
    for key, df in dfs.items():
        df = df.loc[peak_intevrval[0]:peak_intevrval[1]]
        ax.plot(df["time"], df["signal"], label=key)

        # Integrate the peak by simply summing the signal over the time interval
        dt = 1
        peak_integral = df["signal"].sum() * dt
        print(f"{key:15s}: \t{peak_integral:.2f}")

        # Also fit the peaks
        df_corrected = df.copy()
        normalize = lambda x, scale=1.: (x - x.min()) / (x.max() - x.min()) * scale
        df_corrected["time"] = normalize(df["time"], scale=10.)
        df_corrected["signal"] = normalize(df["signal"], scale=100.)

        chrom = Chromatogram(df_corrected)
        peaks = chrom.fit_peaks()

        chrom.show()

ax.legend()
plt.show()

I hope this helps

gchure commented 3 months ago

Thanks. This is helping a bit. I'm playing around with it now and a cursory look at the data shows that setting a high peak prominence and expanding the approximate peak width for the background correction are necessary.

I'll try to look at his more closely over the next week or so.

gchure commented 3 months ago

Ah, found something good here. If I increase the amplitude bounds in the fitting, it works very well (on FileB.txt at least):

data = pd.read_csv('./FileB.txt', delimiter='\t')
data.columns = ['time', 'int1', 'int2']
chrom = Chromatogram(data, cols={'time':'time', 'signal':'int1'})
chrom.crop([1000, 3000])
peaks = chrom.fit_peaks(approx_peak_width=500, param_bounds={'amplitude':[0.001, 100]})
chrom.show()

A general fix for this issue may be in making the parameter bounds more permissive.

lionel42 commented 3 months ago

Thanks this looks good to me.

Just to notify I had an issue which an other spectrum with some negative values

What do you think about having an improved exception when the fitting fails, telling the user to check the parameters and redirecting to a page in the documentation ? Maybe a small tutorial how to setup the parameters based on how the data look ?

These are just suggestions from my side. Feel free to do what you think is the best.

You can close the issue if you want. Thanks a lot for all your help.

gchure commented 3 months ago

What do you think about having an improved exception when the fitting fails, telling the user to check the parameters and redirecting to a page in the documentation?

Yeah, that's a good idea. Catching that exception is a bit tricky because there are multiple things that can yield an x0 is infeasible (see #15 as well). That could be problems with either the initial guess (being out of bounds, though I've tested that error mode) or that the bounds are not proper for the initial guess (as was the case here). Either way, I will think about how to catch that exception and give the user a more intuitive way to break that down. If you want to try to put in that exception handling (and writing unit tests for it), you should feel free to fork and submit a pull request.

Maybe a small tutorial how to setup the parameters based on how the data look?

I think that's a good idea. I'll add that to my roadmap over the next month or so.

I'm going to close this issue, but please open it back up if you have more problems. Thanks for sharing the data and giving a very clear explanation of the error mode!

cremerlab / hplc-py

Unexpected bug with my data #14