cremerlab / hplc-py

A Python utility for the processing and quantification of chromatography data
https://cremerlab.github.io/hplc-py/
GNU General Public License v3.0
19 stars 4 forks source link

Setting `fit_peaks` prominence below 0.2 results in upper bounds with lower value than lower bounds. #8

Closed jonathanstathakis closed 8 months ago

jonathanstathakis commented 8 months ago

I am testing the application of fit_peaks to my dataset, whose profile you can see below:

output

The behavior is as expected unless I set peak prominence below 0.2, at which point I receive an error from least_squares informing me that at least one upper bound is less than the corresponding lower bound.

The error is below:

Traceback (most recent call last):
  File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 125, in <module>
    main()
  File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 122, in main
    HPLCPy(data)
  File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 103, in __init__
    peaks = chm.fit_peaks(
            ^^^^^^^^^^^^^^
  File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/hplc/quant.py", line 801, in fit_peaks
    peak_props = self.deconvolve_peaks(verbose=verbose,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/hplc/quant.py", line 671, in deconvolve_peaks
    popt, _ = scipy.optimize.curve_fit(self._fit_skewnorms, v['time_range'],
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/scipy/optimize/_minpack_py.py", line 974, in curve_fit
    res = least_squares(func, p0, jac=jac, bounds=bounds, method=method,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/scipy/optimize/_lsq/least_squares.py", line 814, in least_squares
    raise ValueError("Each lower bound must be strictly less than each "
ValueError: Each lower bound must be strictly less than each upper bound.

The bounds in question are below, where 'lb' is lower bound, 'ub' is upper bound:

lb ub
0 0.000378979 0.0378979
1 0 0
2 0.0004896 0
3 -inf inf

As I understand it, the prominence is the normalized prominence, i.e. the percentage of the maxima as a threshold for inclusion of a given peak. For the same dataset i've been using a value of 0.05 to ignore the smallest peaks with no problem.

jonathanstathakis commented 8 months ago

Nevermind, as always, I should have checked my input data more thoroughly. I was inputting an augmented signal of stacked samples, I guess the repeating nature of the signal column was causing the issue.

gchure commented 8 months ago

Thanks for looking into that @jonathanstathakis. Please let me know how well hplc-py works on your data!

jonathanstathakis commented 8 months ago

Thanks for looking into that @jonathanstathakis. Please let me know how well hplc-py works on your data!

Will do. I've made a fork, and I've had to make a few modifications to achieve my goal. Would you rather I make a pull request with my modifications, or raise them as issues?

gchure commented 8 months ago

Feel free to raise issues here -- if you think a PR is worthwhile (i.e. makes a substantial & general contribution to the functioning of hplc-py), go for it.

From my reading of your initial issue, are you trying to set bounds on peak prominence? If so, that's not quite how deconvolve_peaks works. It applies bounds on the amplitude, location, scale, and skew (in that order). Given the chromatogram you showed, it seems like the amplitude bounds applied are not consistent with the scale of the data.

Are you passing the bounds you provided in the table as param_bounds in fit_peaks?

jonathanstathakis commented 8 months ago

Thank you, will do!

Not exactly, I was merely calling fit_peaks, the bounds I presented earlier were those generated in deconvolve_peaks prior to being passed to scipy.optimize.curve_fit. I believe I had prominence set quite high to speed up runtime at that point. Also, I was alternatively receiving the bound error above, or an error stating that p0 was out of bounds. But again, to generate this error I was passing an array of stacked signals end to end. And furthermore, on revisiting I am unable to recreate the error.

What I am trying to achieve is a means of joining the peak table and the raw signal to get the intensities of the identified peaks for downstream analysis. The time precision calculation is resulting in two significant figure retention time values which is too granular for me to use time as the join key. I've removed rounding throughout the process, but it appears that the estimated time does not appear precisely within the original signal. Would you be able to suggest a solution to this predicament? I am no expert on the methods used by your wonderful package so I'm learning as I go.

gchure commented 8 months ago

Got it, thanks. That's interesting that the periodic signal messed things up. I'd like to figure out why.

The time precision calculation comes from whatever the sampling interval is in your raw chromatogram (See lines 83-84 here. https://github.com/cremerlab/hplc-py/blob/8ce5223f2c1a4accbcafb3f883b8a692fac94337/hplc/quant.py#L83C13-L83C13). The chromatogram you show in your initial post looks like the time column is ordinal with the index (so your precision would be integer). Do you have a raw time for the trace, or just the indexes? Either way, this could be resolved through a PR where a number of sig figs is user supplied.

Also, note that when you run .fit_peaks, the resulting peak table should give you the parameters for each detected peak in the chromatogram. What would you need for your downstream analysis? If you can give more context there, I'll be able to provide more help.

gchure commented 8 months ago

As there's not an open issue here, I'm going to close this (for now), but I'm happy to keep the thread going.

jonathanstathakis commented 8 months ago

Thanks for keeping this going.

TBH after digging into your code I'm not sure why the bug happened either, further testing showed that the peak finding was simply restricted to the sample with the highest prominence and business proceeded as usual. I'll try to recreate it somewhere down the track..

So the overall goal is to join the peak table with the source signal dataframe, primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table. From there I am looking to automate variance analyses between sample classes, and multiclass ML modeling.

The simplest solution for me would be to output the index of each peak corresponding to its location in the input signal, alongside the retention time. Is that a reasonable expectation or does deconvolve_peaks do some trickery I haven't anticipated?

The displayed signal was a quick visualisation, the actual input has a time axis with a sampling rate of 2 seconds per observation. As you might be able to tell, i've already performed smoothing and ASLS baseline subtraction.

gchure commented 8 months ago

Can you elaborate on what you mean by this?

primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table.

The peak table reports the best-fit values of the signals whose mixture reconstitute the observed chromatogram. If you know what compounds correspond to what retention times, you can add that as a dictionary in the map_peaks method of the Chromatogram object. That would give you labels. Sorry if I'm not understanding something obvious here.

If you've already done ASLS background subtraction, make sure you set correct_baseline=False when calling fit_peaks. This package uses SNIP baseline correction by default.

What was your rationale for using ASLS? If interested, that could be included as an alternative baseline correction algorithm.

jonathanstathakis commented 8 months ago

Can you elaborate on what you mean by this?

primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table.

For the first part, I would intend to compare the results of an external peak profiling software to the results of fit_peaks for external validation. For the second, I mean to use the 'retention_time' columns of both the original signal and the peak table outputted by fit_peaks to join together the original signal table and the peak table, i.e. a SQL join. In that context the retention_time column acts as a primary key where observation time is a unique identifier. The result would be a table with the raw signal in one column and all the columns from the peak table stretched out to form sparse columns, with values where the peak maxima was in the original signal, and empty values in all other locations.

If you've already done ASLS background subtraction, make sure you set correct_baseline=False when calling fit_peaks

Done

What was your rationale for using ASLS?

Nothing beyond the fact that a fair amount of literature that I have seen uses Whittaker baselines for chromatographic signals, and it performed well enough without over or underfitting. I used pybaselines, which provided more intricate approaches, but I thought it best to keep it simple if the result was acceptable. I am no expert in signal correction though, and I have not heard of SNIP before.

I am still in the process of understanding peak convolution, and I was wondering if I could ask whether it is sensible at all to expect the raw signal peak maxima time and deconvoluted peak maxima time to be equal or whether there is some variation inherent to the process?

jonathanstathakis commented 8 months ago

I am still in the process of understanding peak convolution, and I was wondering if I could ask whether it is sensible at all to expect the raw signal peak maxima time and deconvoluted peak maxima time to be equal or whether there is some variation inherent to the process?

Nevermind, I just discovered the very thorough "How it Works" pages! I see now that the peak time does change after deconvolution, or at least has the capacity to, and therefore expecting the peak times to be the same is incorrect.

I am still wondering if its practical to expose the peak indices from the first 'find_peaks' call to provide a link between the raw signal and the fitted peaks? call it 'orig_peak_idx' or something.

gchure commented 8 months ago

Cool! Glad the "how it works" was useful. You should be able to get the peak indices from a private attributechrom._peak_indices. That returns the indices of each peak on the cropped time domain.

gchure commented 8 months ago

Also, if it is helpful, I have a preprint outlining the major functionality of this software. The short answer is that with skewnormal distributions (which is what hplc-py is fitting under the hood), the retention time (location parameter) does not always overlap with the peak signal. That is dependent on the skew parameter. This plot shows an amplitude weighted skew normal distribution with the same location and scale parameters, but varying skew parameters (alpha). Note that the peak location shifts away from 0 under high skewness.

Screenshot 2023-11-20 at 13 42 09
jonathanstathakis commented 8 months ago

Thank you for linking the preprint, much appreciated.

Unfortunately I have encountered issues with the default bounds again - it appears to be peak specific problems. The first issue was an initial guess for width lower than the lower bound (which means that the initial guess is a width lower than the time step), I rectified that by setting the lower bound to be 10% less than the time step, but then I encountered an error with a later peak whose initial guess for time range was larger than the upper bound for that same parameter.

I've been troubleshooting this for a while now and getting nowhere. Do have any recommendations on how to proceed from here?

jonathanstathakis commented 8 months ago

Btw this is the signal extracted from Chromatogram.window_df

image

jonathanstathakis commented 8 months ago

As a follow up, here is more information about my current settings, and the window and peak in question:

fit_peaks parameters

0
tbl_name fit_peak_input_params
tolerance 0.5
prominence 0.01
rel_height 1
approx_peak_width 0.7
buffer 1
max_iter 1000000
precision 9

Window Parameters

tbl_name window_idx num_peaks signal_area
0 window_info 2 4 2735.48

Window Peak Fitting Parameters

tbl_name window_idx peak_idx location amplitude width params lb guess ub delta_lb delta_ub is_oob
0 window_peak_fit_info 2 1 2.6 327.845 0.157637 amplitude 32.7845 327.845 3278.45 295.061 2950.61 0
1 window_peak_fit_info 2 1 2.6 327.845 0.157637 location 1.9 2.6 4.43333 0.7 1.83333 0
2 window_peak_fit_info 2 1 2.6 327.845 0.157637 scale 0.0333333 0.0788184 1.26667 0.0454851 1.18785 0
3 window_peak_fit_info 2 1 2.6 327.845 0.157637 skew -inf 0 inf inf inf 0
4 window_peak_fit_info 2 2 2.76667 165.211 0.0602344 amplitude 16.5211 165.211 1652.11 148.69 1486.9 0
5 window_peak_fit_info 2 2 2.76667 165.211 0.0602344 location 1.9 2.76667 4.43333 0.866667 1.66667 0
6 window_peak_fit_info 2 2 2.76667 165.211 0.0602344 scale 0.0333333 0.0301172 1.26667 -0.00321612 1.23655 1
7 window_peak_fit_info 2 2 2.76667 165.211 0.0602344 skew -inf 0 inf inf inf 0
8 window_peak_fit_info 2 3 3.3 12.5149 0.169807 amplitude 1.25149 12.5149 125.149 11.2635 112.635 0
9 window_peak_fit_info 2 3 3.3 12.5149 0.169807 location 1.9 3.3 4.43333 1.4 1.13333 0
10 window_peak_fit_info 2 3 3.3 12.5149 0.169807 scale 0.0333333 0.0849036 1.26667 0.0515703 1.18176 0
11 window_peak_fit_info 2 3 3.3 12.5149 0.169807 skew -inf 0 inf inf inf 0
12 window_peak_fit_info 2 4 3.8 23.4608 0.235757 amplitude 2.34608 23.4608 234.608 21.1147 211.147 0
13 window_peak_fit_info 2 4 3.8 23.4608 0.235757 location 1.9 3.8 4.43333 1.9 0.633333 0
14 window_peak_fit_info 2 4 3.8 23.4608 0.235757 scale 0.0333333 0.117879 1.26667 0.0845453 1.14879 0
15 window_peak_fit_info 2 4 3.8 23.4608 0.235757 skew -inf 0 inf inf inf 0

Error Peak

Focusing on the erroneous peak:

tbl_name window_idx peak_idx location amplitude width params lb guess ub delta_lb delta_ub is_oob
6 window_peak_fit_info 2 2 2.76667 165.211 0.0602344 scale 0.0333333 0.0301172 1.26667 -0.00321612 1.23655 1

As I said before, the initial guess for the peak is less than the lower bound. The following is a visualisation of the window with the peaks and calculated width displayed:

image

So its fairly evident from the display that scipy.signal.peak_widths is not accurately gauging the width of peak 2, causing the guess to be lower than the lower bound. I'll be continuing my investigation there, but I am hoping that this information will be useful to you. I would be keen to solve this so I can move on with my analysis.