jonathanstathakis commented 8 months ago

I am testing the application of fit_peaks to my dataset, whose profile you can see below:

output

The behavior is as expected unless I set peak prominence below 0.2, at which point I receive an error from least_squares informing me that at least one upper bound is less than the corresponding lower bound.

The error is below:

Traceback (most recent call last):
  File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 125, in <module>
    main()
  File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 122, in main
    HPLCPy(data)
  File "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/dataset_eda/peak_analysis.py", line 103, in __init__
    peaks = chm.fit_peaks(
            ^^^^^^^^^^^^^^
  File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/hplc/quant.py", line 801, in fit_peaks
    peak_props = self.deconvolve_peaks(verbose=verbose,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/hplc/quant.py", line 671, in deconvolve_peaks
    popt, _ = scipy.optimize.curve_fit(self._fit_skewnorms, v['time_range'],
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/scipy/optimize/_minpack_py.py", line 974, in curve_fit
    res = least_squares(func, p0, jac=jac, bounds=bounds, method=method,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jonathan/Library/Caches/pypoetry/virtualenvs/wine-analysis-hplc-uv-F-SbhWjO-py3.11/lib/python3.11/site-packages/scipy/optimize/_lsq/least_squares.py", line 814, in least_squares
    raise ValueError("Each lower bound must be strictly less than each "
ValueError: Each lower bound must be strictly less than each upper bound.

The bounds in question are below, where 'lb' is lower bound, 'ub' is upper bound:

	lb	ub
0	0.000378979	0.0378979
1	0	0
2	0.0004896	0
3	-inf	inf

As I understand it, the prominence is the normalized prominence, i.e. the percentage of the maxima as a threshold for inclusion of a given peak. For the same dataset i've been using a value of 0.05 to ignore the smallest peaks with no problem.

jonathanstathakis commented 8 months ago

Nevermind, as always, I should have checked my input data more thoroughly. I was inputting an augmented signal of stacked samples, I guess the repeating nature of the signal column was causing the issue.

gchure commented 8 months ago

Thanks for looking into that @jonathanstathakis. Please let me know how well hplc-py works on your data!

jonathanstathakis commented 8 months ago

Thanks for looking into that @jonathanstathakis. Please let me know how well hplc-py works on your data!

Will do. I've made a fork, and I've had to make a few modifications to achieve my goal. Would you rather I make a pull request with my modifications, or raise them as issues?

gchure commented 8 months ago

Feel free to raise issues here -- if you think a PR is worthwhile (i.e. makes a substantial & general contribution to the functioning of hplc-py), go for it.

From my reading of your initial issue, are you trying to set bounds on peak prominence? If so, that's not quite how deconvolve_peaks works. It applies bounds on the amplitude, location, scale, and skew (in that order). Given the chromatogram you showed, it seems like the amplitude bounds applied are not consistent with the scale of the data.

Are you passing the bounds you provided in the table as param_bounds in fit_peaks?

jonathanstathakis commented 8 months ago

Thank you, will do!

Not exactly, I was merely calling fit_peaks, the bounds I presented earlier were those generated in deconvolve_peaks prior to being passed to scipy.optimize.curve_fit. I believe I had prominence set quite high to speed up runtime at that point. Also, I was alternatively receiving the bound error above, or an error stating that p0 was out of bounds. But again, to generate this error I was passing an array of stacked signals end to end. And furthermore, on revisiting I am unable to recreate the error.

What I am trying to achieve is a means of joining the peak table and the raw signal to get the intensities of the identified peaks for downstream analysis. The time precision calculation is resulting in two significant figure retention time values which is too granular for me to use time as the join key. I've removed rounding throughout the process, but it appears that the estimated time does not appear precisely within the original signal. Would you be able to suggest a solution to this predicament? I am no expert on the methods used by your wonderful package so I'm learning as I go.

gchure commented 8 months ago

Got it, thanks. That's interesting that the periodic signal messed things up. I'd like to figure out why.

The time precision calculation comes from whatever the sampling interval is in your raw chromatogram (See lines 83-84 here. https://github.com/cremerlab/hplc-py/blob/8ce5223f2c1a4accbcafb3f883b8a692fac94337/hplc/quant.py#L83C13-L83C13). The chromatogram you show in your initial post looks like the time column is ordinal with the index (so your precision would be integer). Do you have a raw time for the trace, or just the indexes? Either way, this could be resolved through a PR where a number of sig figs is user supplied.

Also, note that when you run .fit_peaks, the resulting peak table should give you the parameters for each detected peak in the chromatogram. What would you need for your downstream analysis? If you can give more context there, I'll be able to provide more help.

gchure commented 8 months ago

As there's not an open issue here, I'm going to close this (for now), but I'm happy to keep the thread going.

jonathanstathakis commented 8 months ago

Thanks for keeping this going.

TBH after digging into your code I'm not sure why the bug happened either, further testing showed that the peak finding was simply restricted to the sample with the highest prominence and business proceeded as usual. I'll try to recreate it somewhere down the track..

So the overall goal is to join the peak table with the source signal dataframe, primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table. From there I am looking to automate variance analyses between sample classes, and multiclass ML modeling.

The simplest solution for me would be to output the index of each peak corresponding to its location in the input signal, alongside the retention time. Is that a reasonable expectation or does deconvolve_peaks do some trickery I haven't anticipated?

The displayed signal was a quick visualisation, the actual input has a time axis with a sampling rate of 2 seconds per observation. As you might be able to tell, i've already performed smoothing and ASLS baseline subtraction.

gchure commented 8 months ago

Can you elaborate on what you mean by this?

primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table.

The peak table reports the best-fit values of the signals whose mixture reconstitute the observed chromatogram. If you know what compounds correspond to what retention times, you can add that as a dictionary in the map_peaks method of the Chromatogram object. That would give you labels. Sorry if I'm not understanding something obvious here.

If you've already done ASLS background subtraction, make sure you set correct_baseline=False when calling fit_peaks. This package uses SNIP baseline correction by default.

What was your rationale for using ASLS? If interested, that could be included as an alternative baseline correction algorithm.

jonathanstathakis commented 8 months ago

Can you elaborate on what you mean by this?

primarily to validate that the peak fitting has behaved as I expected, but also to join label columns to the peak table.

For the first part, I would intend to compare the results of an external peak profiling software to the results of fit_peaks for external validation. For the second, I mean to use the 'retention_time' columns of both the original signal and the peak table outputted by fit_peaks to join together the original signal table and the peak table, i.e. a SQL join. In that context the retention_time column acts as a primary key where observation time is a unique identifier. The result would be a table with the raw signal in one column and all the columns from the peak table stretched out to form sparse columns, with values where the peak maxima was in the original signal, and empty values in all other locations.

If you've already done ASLS background subtraction, make sure you set correct_baseline=False when calling fit_peaks

Done

What was your rationale for using ASLS?

Nothing beyond the fact that a fair amount of literature that I have seen uses Whittaker baselines for chromatographic signals, and it performed well enough without over or underfitting. I used pybaselines, which provided more intricate approaches, but I thought it best to keep it simple if the result was acceptable. I am no expert in signal correction though, and I have not heard of SNIP before.

I am still in the process of understanding peak convolution, and I was wondering if I could ask whether it is sensible at all to expect the raw signal peak maxima time and deconvoluted peak maxima time to be equal or whether there is some variation inherent to the process?

jonathanstathakis commented 8 months ago

I am still in the process of understanding peak convolution, and I was wondering if I could ask whether it is sensible at all to expect the raw signal peak maxima time and deconvoluted peak maxima time to be equal or whether there is some variation inherent to the process?

Nevermind, I just discovered the very thorough "How it Works" pages! I see now that the peak time does change after deconvolution, or at least has the capacity to, and therefore expecting the peak times to be the same is incorrect.

I am still wondering if its practical to expose the peak indices from the first 'find_peaks' call to provide a link between the raw signal and the fitted peaks? call it 'orig_peak_idx' or something.

gchure commented 8 months ago

Cool! Glad the "how it works" was useful. You should be able to get the peak indices from a private attributechrom._peak_indices. That returns the indices of each peak on the cropped time domain.

gchure commented 8 months ago

Also, if it is helpful, I have a preprint outlining the major functionality of this software. The short answer is that with skewnormal distributions (which is what hplc-py is fitting under the hood), the retention time (location parameter) does not always overlap with the peak signal. That is dependent on the skew parameter. This plot shows an amplitude weighted skew normal distribution with the same location and scale parameters, but varying skew parameters (alpha). Note that the peak location shifts away from 0 under high skewness.

jonathanstathakis commented 8 months ago

Thank you for linking the preprint, much appreciated.

Unfortunately I have encountered issues with the default bounds again - it appears to be peak specific problems. The first issue was an initial guess for width lower than the lower bound (which means that the initial guess is a width lower than the time step), I rectified that by setting the lower bound to be 10% less than the time step, but then I encountered an error with a later peak whose initial guess for time range was larger than the upper bound for that same parameter.

I've been troubleshooting this for a while now and getting nowhere. Do have any recommendations on how to proceed from here?

jonathanstathakis commented 8 months ago

Btw this is the signal extracted from Chromatogram.window_df

jonathanstathakis commented 8 months ago

As a follow up, here is more information about my current settings, and the window and peak in question:

`fit_peaks` parameters

	0
tbl_name	fit_peak_input_params
tolerance	0.5
prominence	0.01
rel_height	1
approx_peak_width	0.7
buffer	1
max_iter	1000000
precision	9

Window Parameters

	tbl_name	window_idx	num_peaks	signal_area
0	window_info	2	4	2735.48

Window Peak Fitting Parameters

	tbl_name	window_idx	peak_idx	location	amplitude	width	params	lb	guess	ub	delta_lb	delta_ub	is_oob
0	window_peak_fit_info	2	1	2.6	327.845	0.157637	amplitude	32.7845	327.845	3278.45	295.061	2950.61	0
1	window_peak_fit_info	2	1	2.6	327.845	0.157637	location	1.9	2.6	4.43333	0.7	1.83333	0
2	window_peak_fit_info	2	1	2.6	327.845	0.157637	scale	0.0333333	0.0788184	1.26667	0.0454851	1.18785	0
3	window_peak_fit_info	2	1	2.6	327.845	0.157637	skew	-inf	0	inf	inf	inf	0
4	window_peak_fit_info	2	2	2.76667	165.211	0.0602344	amplitude	16.5211	165.211	1652.11	148.69	1486.9	0
5	window_peak_fit_info	2	2	2.76667	165.211	0.0602344	location	1.9	2.76667	4.43333	0.866667	1.66667	0
6	window_peak_fit_info	2	2	2.76667	165.211	0.0602344	scale	0.0333333	0.0301172	1.26667	-0.00321612	1.23655	1
7	window_peak_fit_info	2	2	2.76667	165.211	0.0602344	skew	-inf	0	inf	inf	inf	0
8	window_peak_fit_info	2	3	3.3	12.5149	0.169807	amplitude	1.25149	12.5149	125.149	11.2635	112.635	0
9	window_peak_fit_info	2	3	3.3	12.5149	0.169807	location	1.9	3.3	4.43333	1.4	1.13333	0
10	window_peak_fit_info	2	3	3.3	12.5149	0.169807	scale	0.0333333	0.0849036	1.26667	0.0515703	1.18176	0
11	window_peak_fit_info	2	3	3.3	12.5149	0.169807	skew	-inf	0	inf	inf	inf	0
12	window_peak_fit_info	2	4	3.8	23.4608	0.235757	amplitude	2.34608	23.4608	234.608	21.1147	211.147	0
13	window_peak_fit_info	2	4	3.8	23.4608	0.235757	location	1.9	3.8	4.43333	1.9	0.633333	0
14	window_peak_fit_info	2	4	3.8	23.4608	0.235757	scale	0.0333333	0.117879	1.26667	0.0845453	1.14879	0
15	window_peak_fit_info	2	4	3.8	23.4608	0.235757	skew	-inf	0	inf	inf	inf	0

Error Peak

Focusing on the erroneous peak:

	tbl_name	window_idx	peak_idx	location	amplitude	width	params	lb	guess	ub	delta_lb	delta_ub	is_oob
6	window_peak_fit_info	2	2	2.76667	165.211	0.0602344	scale	0.0333333	0.0301172	1.26667	-0.00321612	1.23655	1

As I said before, the initial guess for the peak is less than the lower bound. The following is a visualisation of the window with the peaks and calculated width displayed:

So its fairly evident from the display that scipy.signal.peak_widths is not accurately gauging the width of peak 2, causing the guess to be lower than the lower bound. I'll be continuing my investigation there, but I am hoping that this information will be useful to you. I would be keen to solve this so I can move on with my analysis.

cremerlab / hplc-py

Setting `fit_peaks` prominence below 0.2 results in upper bounds with lower value than lower bounds. #8

`fit_peaks` parameters

Window Parameters

Window Peak Fitting Parameters

Error Peak

cremerlab / hplc-py

Setting `fit_peaks` prominence below 0.2 results in upper bounds with lower value than lower bounds. #8

fit_peaks parameters

Window Parameters

Window Peak Fitting Parameters

Error Peak

`fit_peaks` parameters