Perform Gaussian fit if more than 3 wavelengths

mgalloy commented 7 months ago

Instead of always doing an analytical Gaussian "fit", the pipeline should do an actual Gaussian fit if there are more than 3 wavelengths in the file.

The code to perform the fit, per pixel, is:

fit = gaussfit(wavelengths, intensities, coefficients, nterms=4)

Questions

[x] Should I save coefficients from the fit somewhere? That would require another extension in the file because it would be 4 coefficients per pixel of the image.
[ ] What is the convergence criteria? chi-squared less than some threshold?
[ ] How do you compute peak intensity, doppler shift, and line width from the Gaussian fit? I can look up what we did in CoMP and I assume it should be handled the same way.

Tasks

[x] Add config option for "fit" method: how many terms in Gaussian fit or analytic.
[ ] Mask spatially based on geometry
[ ] Mask on wavelengths based on thresholds and how many terms in the Gaussian fit — need 5 wavelengths to do 4-term fit, 4 wavelengths to do 3-term fit
[ ] Compute Gaussian fit with both 3 and 4 term fits and put both LOS velocity and line width as added extension.
[ ] Process 20220115, 20220116, 20220119, and 20240409 as tests.
[x] Return SIGMA, not CHISQ.

bberkeyU commented 7 months ago

How did we handle fitting in comp? Was it always the analytical gauss?

GdT: Yes, the pipeline always did an analytic gaussian fit using the three center wavelength positions. This may not work now if we take positions very closely spaced, because the three center wavelengths are not independent.

bberkeyU commented 7 months ago

At least for the initial testing I think having the fits and chisq saved for each pixel would be helpful.

My naive starting point for the threshold would be picking a chisq (or other metic) value that passes a similar set of pixels as the pipeline currently passes with analytical gauss metic

detoma commented 7 months ago

I had already opened a ticket for this #253. A better way to handle the extra points is not to always make a least square fit if there are more the 3 wavelengths, but have an option if there are more than 3 wavelengths. There may be cases when we may want to revert to the analytic fit.

The new wavelength positions, if there are more then 3, may not be equally spaced, so this is something the pipeline needs to handle.

mgalloy commented 7 months ago

I am not sure what you mean about the pipeline handling the wavelengths not being equally spaced. The wavelengths will be input as into gaussfit — they could be anything as long as they are given to gaussfit.

mgalloy commented 7 months ago

I have a run in process.latest for 20240330. There are new extensions in the level 2 files that give parameters of the fit, by pixel:

...
 13  Fit coefficients           science  -----  -----  -----
 14  Fit chi-squared            science  -----  -----  --—-

They are arrays giving the coefficients and chi-squared arrays, by pixel:

IDL> help, coeffs, chisq
COEFFS          FLOAT     = Array[1280, 1024, 4]
CHISQ           FLOAT     = Array[1280, 1024]

detoma commented 7 months ago

I like to have an option in the pipeline to do the analytical gaussian fit or the least square fit for the cases with 5 or more wavelengths. I think the analytical gaussian fit assume the wavelengths are equally spaced but the 5 or more wavelength positions may not be equally spaces. We need to put some logic on how to select the proper 3 wavelengths for the analytical gaussian fit in case we have more than 3 points. I think the way to do it is to find the two closest wavelengths to the blue and red used for the waves. This are not necessarily the three center wavelengths.

In other words, we cannot simply choose the center 3 wavelengths for the analytical gaussian fit if we have more than three points.

mgalloy commented 7 months ago

OK, so the wavelengths only matter if we are using the analytic solution with more than 3 wavelengths? Are the following statements all true?

If there are only 3 wavelengths, we obviously do an analytic solution with them.
If there are more than 3 wavelengths and we have the fit option set, we use all the wavelengths in an actual Gaussian fit.
If there are more than 3 wavelengths and we have the fit option not set, we have some scheme to pick which 3 wavelengths to use in the analytic solution.

bberkeyU commented 7 months ago

The LOS velocity, Line width, and Noise mask are 0 (or NaNs) across the whole array in the L2 files in current process.latest run. Was this expected?

mgalloy commented 7 months ago

Yes, I haven't figured out how to actually calculate anything yet (see questions at the top). But there is chi-squared array for these fits that is interesting to look at in that run.

detoma commented 7 months ago

Mike, the statements above are correct but we need to add another option, i.e. a custom option where we specify the wavelengths to use in the least square gaussian fit. There are some wavelength position we may want to avoid because of contamination from other lines. This is complicated but with such a flexible instrument is to be expected. let's discuss.

bberkeyU commented 7 months ago

@mgalloy I agree chi-squared looks weird. I assumed it would better match the corona.

mgalloy commented 7 months ago

Remember that peak intensity in that processing is definitely not correct yet.

bberkeyU commented 7 months ago

@mgalloy I agree, but to first order it should show were the corona is.

bberkeyU commented 7 months ago

For the fit masking, I think we can look at the first term of the fit.

If fit[0] > 0, we get a mask that matches reasonably well with the locations with good signals in the peak intensity plots.

bberkeyU commented 7 months ago

Can we pull the LOS and line widths directly from the Gaussian fit results fit[1] and fit[2], with appropriate corrections to convert from nm to km/s?

For 1074.7 LOS, I think this would look like:

LOS = (fit[1] - 1074.7) * 299792.458 / 1074.7

We need to perform the conversion for the line width, the IDL gauss routine reports in stdev, not FWHM, and so on.

line_width = (fit[2] - 1074.7) * 299792.458 / 1074.7 * 2.355

detoma commented 7 months ago

Ben, please let us work on this. I just sent an email. Yes, the conversion will be done.

detoma commented 7 months ago

There are several issues that need to be taken care of to allow the least-square fit to properly work in the automatic pipeline. The strategy Mike and I have agreed on is the following:

The analytical gaussian fit will be performed for all lines. For the lines that have only three points, this will be the end of the processing and nothing will change compared to before. For the lines that have more than three points, we will identify the wavelength closer to the "nominal" ones used in the three points observations and then perform the analytic gaussian fit on those three points. This is because using the center three points does not always work well, especially if we oversample the line and the three center points are very close together.
The analytic gaussian fit will be used as an initial guess for the least square gaussian fit with 4 terms. This is because the least square fit does not always give good fits, especially in pixels with low signal. We found unrealistic gaussian profiles see examples below). To further avoid failing of the least square fitting procedure, we need to consider how to select the points for the fit, e.g. eliminating points below a certain threshold or negative. Also, the chi square is not normalized and results in very low values in the outer for where the intensity is low, even if the fit is poor. Mike will normalize the chi square so we can use it as an indicator of how good the fit really is
There will be a second "custom" option for lines with more than three points, where we specify the wavelengths to use in the fit. This is because there may be observing programs where we want to esclude some wavelength positions that are not reliable.
Finally, since the analytical gaussian fit equation is different from the one used in IDL gaussfit, we need to make sure that all quantities are consistent and we initialize the fit with the proper coefficients. The final products will be the same, regardless of the fitting technique: peak intensity, FWHM, and doppler velocity.

After all this is implemented, we need to verify if the peak intensity, velocity and FWHM look good or if the noise in the outer fov is too large. I worry about noise in the outer fov especially for the weaker lines. We may choose the 3-term fit, if more stable. It will be interesting to compare the peak intensity, velocity, and FWHM for the 3-term and 4-term fit.

Example of fits:

The white is the analytical fit, the blu the least square with 3 terms, and the red the least square with 4 terms. The 4 terms fit should provide a more realistic peak intensity and line width and a better fit in the wings but fails badly in some pixel, as shown above.

Another failed fit.

Two examples where the least square fit worked well.

mgalloy commented 7 months ago

Doing the analytical "fit" takes about 2-5 seconds for a level 2 file. Doing the actual Gaussian fit takes 7-15 minutes per level 2 file.

detoma commented 7 months ago

Would rewrite this in C speed it up?

The 7-15 minutes are for a fitting all the pixels or only the pixels that pass the velocity criteria?

Criteria need to be revisited for 5-7 points data. I do not think we want to use negative values. The current criteria only apply to the three center points.

mgalloy commented 7 months ago

There are several things that we can do to speed it up:

mask pixels so we don't have to fit all of them like it is doing now
using the ESTIMATES keyword, which we planned to do anyway to get a better fit, would allow the fit to converge faster
fitting individual pixels is an "embarrassingly parallel" operation, so we could run on multiple cores

Remember that the waves program is a lot of the files for a day and would not use the fit.

detoma commented 7 months ago

Below is an example when the least square gaussian fit uses the far red and blue points that are too low and find an unrealistic gaussian fits. The blue curve is a fit with 3 terms and the red curve a fit with 4 terms. They are both bad fits to the data. We need to exclude data points with intensity below a threshold (0.1 ppm??).

Below is the same plot excluding the first and last point which are below 0.1ppm. The 3 term fit is now good, the 4 terms fit is bad.

Below is the same plot excluding the first and last point and using the estimate from the analytical gaussian fit to help converging. The 3 term fit is good, the 4 terms is still bad.

Note: In the analytical fit I used the wavelengths closets to the nominal 3-points position and not the 3 center wavelengths.

Conclusions:

We need to exclude points with intensity below 0.1ppm. This means not all pixels will use the same number of wavelengths to compute the gaussian, which is not ideal.
Adding the keyword "estimates" with values from the analytic fit works well and speed up the fit.
We may not be able to use the 4-term fit that includes a background term.

detoma commented 6 months ago

Joan and I looked at several profiles. The 4-term least square gaussian fit fails more often than the 3-term bit generally gives a better fit when the lines is not well centered and there are 5-points available.

We want the pipeline to have both options. We will test the 3-term and 4-term fit on two days when the line was well centered and when was not well centered and then decide which fit to use for the periods when UCoMP had large line shifts and we have 5-points.

The least square fit needs a guess or it is more likely to fail (possibly because it finds a secondary minimum). This is the code I use to call gauss fit:

gauss_lq_fit = gaussfit(wave, profile, aa, nterms=3, sigma=sigma, chisq=chisq,$
                        estimates=[peak_intensity, xpeak, line_width*sqrt(2.0)])

where peak intensity, xpeak, and line_width comes from the analytical Gaussian fit:

comp_analytic_gauss_fit, profile, d_lambda, doppler_shift, line_width, peak_intensity

xpeak can be computed as:

xpeak = lambda_zero + doppler_shift

where lambda_zero is the nominal center position.

We must request that all wavelength positions have intensity > 0.1 millions before calling GAUSSFIT. This condition must be added to the ones already in place for the analytical Gaussian fit.

We can discuss it further if there are questions on how implement this in the pipeline.

bberkeyU commented 6 months ago

How do we calculate lambda_zero is this 1074.7 for our favorite wavelength region?

detoma commented 6 months ago

We do not calculate lambda_zero. Lambda_zero is the nominal center wavelength and is our reference. The analytical gaussian fit returns doppler_shift which is the display in nm of the peak of the fitted gaussian from the nominal center wavelength position.

Because we are not always well centered on the line, the velocities are later renormalized by subtracting the "rest wavelength" computed by averaging the median velocity of the east and west coronal pixels above a certain threshold.

mgalloy commented 6 months ago

I think I have reasonable runs for the analytic Gaussian "fit", the 3-term fit, and the 4-term fit in:

process.analytic-gaussian
process.3-term
process.4-term

mgalloy commented 6 months ago

I reprocessed the analytic-gaussian run to have the same distortion and be comparable to the 3-term and 4-term runs.

mgalloy commented 5 months ago

I processed 20220115 and 20220116 with both analytic, 3-term, and 4-term fits. Results in:

process.analytic-gaussian
process.3-term
process.4-term

bberkeyU commented 5 months ago

Looking at 20220115.191933.ucomp.1074.l2.fts

I find the 3 and 4 term fits have identical LOS pixels values. But both have a ~ 30 km/s offset between the analytical and 3/4 point fits. Is this what we expected?

bberkeyU commented 5 months ago

The noise masking appears to fail for 20220115.191933 fits file. In the analytic-gaussian the noise mask values range from 0-1, instead of a binary good which I expected from the code. And the line width appears null in the 3/4 term fits so the mask is also null for these data.

detoma commented 5 months ago

I looked at a few random 1074 FITS images for the 3 test days and compared peak intensity, velocity and line width. The images in the directories process.3-term, and process.4-term are identical (min and max of the difference image is zero).

The rest wavelength is zero in the 3 and 4 term fits and it seems the difference between the analytic and the least square is basically the rest wavelength, not clear they are real differences beside that one.

This needs more work from MIke to debug the code.

mgalloy commented 5 months ago

Questions about the current runs:

[ ] Confirm settings, especially for the 2022 runs (are the fits really being done, or is it falling back to analytic Gaussian "fit").
[ ] How is RSTWVL computed? why 0.0 so much? or -33?
[ ] Noise mask for analytic-gaussian had values between 0 and 1.
- for example, 20220115.191933
- I don't see this, not sure of what the discrepancy is.
[ ] Are 3- and 4-term fits identical except mask? why?

bberkeyU commented 5 months ago

This doesn't add any new information from yesterday's meeting. Three versions of the fits seem identical.

I put together some Python code to look at the LOS value for all the pixels 2380 pixels in a ring at height =300 (pixels) for every 1074 5 point data reprocessed in the 'process.analytic-gaussian', 'process.4-term' processing directories.

After correcting the pixel values by adding the rest wavelength reported in the header. The 3 and 4-term fits are identical. The differences between the analytic and fit are very small, less than .0005 km/s. Most of this difference seems to be from a DC offset (maybe noise in the rest wavelength calculation) that changes per fits file. Within a single file, the differences between the fits are closer to 1e-5 km/s.

bberkeyU commented 5 months ago

@detoma @jburkepile This is just a reminder that on 2022-02-02, we moved the field lens to its correct location (closer to the lyot stop). This should have changed the background (I forget if it did). In today's meeting, we talked about adding some February dates to the Gaussian data checking; it would be interesting to compare the before-and-after field lens moves.

jburkepile commented 5 months ago

Thanks for checking the change to the field lens location Ben and adding it to the ticket. Looks like you have data before and after the field lens change on Feb 2.

It looks like the stray light begins to diminish on Feb 8 and Feb 9 and by Feb 10, 2022 the data look good. It would be good to have Mike run Gaussian fits (analytic, 3- and 4-parameter) for a sample of days. How about reprocessing: Feb 1, Feb 2, Feb 7, Feb 10, Feb 14, 2022.

Are 5 days sufficient?

mgalloy commented 5 months ago

So not Jan 15, 16, and 19? or 20240409?

NCAR / ucomp-pipeline

Perform Gaussian fit if more than 3 wavelengths #259

Questions

Tasks