NCAR / ucomp-pipeline

Data processing pipeline for UCoMP
Other
6 stars 3 forks source link

Perform Gaussian fit if more than 3 wavelengths #259

Open mgalloy opened 7 months ago

mgalloy commented 7 months ago

Instead of always doing an analytical Gaussian "fit", the pipeline should do an actual Gaussian fit if there are more than 3 wavelengths in the file.

The code to perform the fit, per pixel, is:

fit = gaussfit(wavelengths, intensities, coefficients, nterms=4)

Questions

Tasks

bberkeyU commented 7 months ago

How did we handle fitting in comp? Was it always the analytical gauss?

GdT: Yes, the pipeline always did an analytic gaussian fit using the three center wavelength positions. This may not work now if we take positions very closely spaced, because the three center wavelengths are not independent.

bberkeyU commented 7 months ago

At least for the initial testing I think having the fits and chisq saved for each pixel would be helpful.

My naive starting point for the threshold would be picking a chisq (or other metic) value that passes a similar set of pixels as the pipeline currently passes with analytical gauss metic

detoma commented 7 months ago

I had already opened a ticket for this #253. A better way to handle the extra points is not to always make a least square fit if there are more the 3 wavelengths, but have an option if there are more than 3 wavelengths. There may be cases when we may want to revert to the analytic fit.

The new wavelength positions, if there are more then 3, may not be equally spaced, so this is something the pipeline needs to handle.

mgalloy commented 7 months ago

I am not sure what you mean about the pipeline handling the wavelengths not being equally spaced. The wavelengths will be input as into gaussfit — they could be anything as long as they are given to gaussfit.

mgalloy commented 7 months ago

I have a run in process.latest for 20240330. There are new extensions in the level 2 files that give parameters of the fit, by pixel:

...
 13  Fit coefficients           science  -----  -----  -----
 14  Fit chi-squared            science  -----  -----  --—-

They are arrays giving the coefficients and chi-squared arrays, by pixel:

IDL> help, coeffs, chisq
COEFFS          FLOAT     = Array[1280, 1024, 4]
CHISQ           FLOAT     = Array[1280, 1024]
detoma commented 7 months ago

I like to have an option in the pipeline to do the analytical gaussian fit or the least square fit for the cases with 5 or more wavelengths. I think the analytical gaussian fit assume the wavelengths are equally spaced but the 5 or more wavelength positions may not be equally spaces. We need to put some logic on how to select the proper 3 wavelengths for the analytical gaussian fit in case we have more than 3 points. I think the way to do it is to find the two closest wavelengths to the blue and red used for the waves. This are not necessarily the three center wavelengths.

In other words, we cannot simply choose the center 3 wavelengths for the analytical gaussian fit if we have more than three points.

mgalloy commented 7 months ago

OK, so the wavelengths only matter if we are using the analytic solution with more than 3 wavelengths? Are the following statements all true?

bberkeyU commented 7 months ago

The LOS velocity, Line width, and Noise mask are 0 (or NaNs) across the whole array in the L2 files in current process.latest run. Was this expected?

mgalloy commented 7 months ago

Yes, I haven't figured out how to actually calculate anything yet (see questions at the top). But there is chi-squared array for these fits that is interesting to look at in that run.

detoma commented 7 months ago

Mike, the statements above are correct but we need to add another option, i.e. a custom option where we specify the wavelengths to use in the least square gaussian fit. There are some wavelength position we may want to avoid because of contamination from other lines. This is complicated but with such a flexible instrument is to be expected. let's discuss.

bberkeyU commented 7 months ago

@mgalloy I agree chi-squared looks weird. I assumed it would better match the corona.

image image

mgalloy commented 7 months ago

Remember that peak intensity in that processing is definitely not correct yet.

bberkeyU commented 7 months ago

@mgalloy I agree, but to first order it should show were the corona is.

bberkeyU commented 7 months ago

For the fit masking, I think we can look at the first term of the fit.

If fit[0] > 0, we get a mask that matches reasonably well with the locations with good signals in the peak intensity plots.

bberkeyU commented 7 months ago

Can we pull the LOS and line widths directly from the Gaussian fit results fit[1] and fit[2], with appropriate corrections to convert from nm to km/s?

For 1074.7 LOS, I think this would look like:

LOS = (fit[1] - 1074.7) * 299792.458 / 1074.7

We need to perform the conversion for the line width, the IDL gauss routine reports in stdev, not FWHM, and so on.

line_width = (fit[2] - 1074.7) * 299792.458 / 1074.7 * 2.355
detoma commented 7 months ago

Ben, please let us work on this. I just sent an email. Yes, the conversion will be done.

detoma commented 7 months ago

There are several issues that need to be taken care of to allow the least-square fit to properly work in the automatic pipeline. The strategy Mike and I have agreed on is the following:

After all this is implemented, we need to verify if the peak intensity, velocity and FWHM look good or if the noise in the outer fov is too large. I worry about noise in the outer fov especially for the weaker lines. We may choose the 3-term fit, if more stable. It will be interesting to compare the peak intensity, velocity, and FWHM for the 3-term and 4-term fit.

Example of fits:

Screenshot 2024-04-16 at 6 31 08 PM

The white is the analytical fit, the blu the least square with 3 terms, and the red the least square with 4 terms. The 4 terms fit should provide a more realistic peak intensity and line width and a better fit in the wings but fails badly in some pixel, as shown above.

Screenshot 2024-04-16 at 6 50 50 PM

Another failed fit.

Screenshot 2024-04-16 at 6 40 16 PM Screenshot 2024-04-16 at 6 51 49 PM

Two examples where the least square fit worked well.

mgalloy commented 7 months ago

Doing the analytical "fit" takes about 2-5 seconds for a level 2 file. Doing the actual Gaussian fit takes 7-15 minutes per level 2 file.

detoma commented 7 months ago

Would rewrite this in C speed it up?

The 7-15 minutes are for a fitting all the pixels or only the pixels that pass the velocity criteria?

Criteria need to be revisited for 5-7 points data. I do not think we want to use negative values. The current criteria only apply to the three center points.

mgalloy commented 7 months ago

There are several things that we can do to speed it up:

  1. mask pixels so we don't have to fit all of them like it is doing now
  2. using the ESTIMATES keyword, which we planned to do anyway to get a better fit, would allow the fit to converge faster
  3. fitting individual pixels is an "embarrassingly parallel" operation, so we could run on multiple cores

Remember that the waves program is a lot of the files for a day and would not use the fit.

detoma commented 7 months ago

Below is an example when the least square gaussian fit uses the far red and blue points that are too low and find an unrealistic gaussian fits. The blue curve is a fit with 3 terms and the red curve a fit with 4 terms. They are both bad fits to the data. We need to exclude data points with intensity below a threshold (0.1 ppm??).

Screenshot 2024-04-18 at 1 38 45 PM

Below is the same plot excluding the first and last point which are below 0.1ppm. The 3 term fit is now good, the 4 terms fit is bad.

Screenshot 2024-04-18 at 1 48 29 PM

Below is the same plot excluding the first and last point and using the estimate from the analytical gaussian fit to help converging. The 3 term fit is good, the 4 terms is still bad.

Screenshot 2024-04-18 at 1 55 05 PM

Note: In the analytical fit I used the wavelengths closets to the nominal 3-points position and not the 3 center wavelengths.

Conclusions:

detoma commented 6 months ago

Joan and I looked at several profiles. The 4-term least square gaussian fit fails more often than the 3-term bit generally gives a better fit when the lines is not well centered and there are 5-points available.

We want the pipeline to have both options. We will test the 3-term and 4-term fit on two days when the line was well centered and when was not well centered and then decide which fit to use for the periods when UCoMP had large line shifts and we have 5-points.

The least square fit needs a guess or it is more likely to fail (possibly because it finds a secondary minimum). This is the code I use to call gauss fit:

gauss_lq_fit = gaussfit(wave, profile, aa, nterms=3, sigma=sigma, chisq=chisq,$
                        estimates=[peak_intensity, xpeak, line_width*sqrt(2.0)])

where peak intensity, xpeak, and line_width comes from the analytical Gaussian fit:

comp_analytic_gauss_fit, profile, d_lambda, doppler_shift, line_width, peak_intensity

xpeak can be computed as:

xpeak = lambda_zero + doppler_shift

where lambda_zero is the nominal center position.

We must request that all wavelength positions have intensity > 0.1 millions before calling GAUSSFIT. This condition must be added to the ones already in place for the analytical Gaussian fit.

We can discuss it further if there are questions on how implement this in the pipeline.

bberkeyU commented 6 months ago

How do we calculate lambda_zero is this 1074.7 for our favorite wavelength region?

detoma commented 6 months ago

We do not calculate lambda_zero. Lambda_zero is the nominal center wavelength and is our reference. The analytical gaussian fit returns doppler_shift which is the display in nm of the peak of the fitted gaussian from the nominal center wavelength position.

Because we are not always well centered on the line, the velocities are later renormalized by subtracting the "rest wavelength" computed by averaging the median velocity of the east and west coronal pixels above a certain threshold.

mgalloy commented 6 months ago

I think I have reasonable runs for the analytic Gaussian "fit", the 3-term fit, and the 4-term fit in:

mgalloy commented 6 months ago

I reprocessed the analytic-gaussian run to have the same distortion and be comparable to the 3-term and 4-term runs.

mgalloy commented 5 months ago

I processed 20220115 and 20220116 with both analytic, 3-term, and 4-term fits. Results in:

bberkeyU commented 5 months ago

Looking at 20220115.191933.ucomp.1074.l2.fts

I find the 3 and 4 term fits have identical LOS pixels values. But both have a ~ 30 km/s offset between the analytical and 3/4 point fits. Is this what we expected?

bberkeyU commented 5 months ago

The noise masking appears to fail for 20220115.191933 fits file. In the analytic-gaussian the noise mask values range from 0-1, instead of a binary good which I expected from the code. And the line width appears null in the 3/4 term fits so the mask is also null for these data.

image

detoma commented 5 months ago

I looked at a few random 1074 FITS images for the 3 test days and compared peak intensity, velocity and line width. The images in the directories process.3-term, and process.4-term are identical (min and max of the difference image is zero).

The rest wavelength is zero in the 3 and 4 term fits and it seems the difference between the analytic and the least square is basically the rest wavelength, not clear they are real differences beside that one.

This needs more work from MIke to debug the code.

mgalloy commented 5 months ago

Questions about the current runs:

bberkeyU commented 5 months ago

This doesn't add any new information from yesterday's meeting. Three versions of the fits seem identical.

I put together some Python code to look at the LOS value for all the pixels 2380 pixels in a ring at height =300 (pixels) for every 1074 5 point data reprocessed in the 'process.analytic-gaussian', 'process.4-term' processing directories.

After correcting the pixel values by adding the rest wavelength reported in the header. The 3 and 4-term fits are identical. The differences between the analytic and fit are very small, less than .0005 km/s. Most of this difference seems to be from a DC offset (maybe noise in the rest wavelength calculation) that changes per fits file. Within a single file, the differences between the fits are closer to 1e-5 km/s.

bberkeyU commented 5 months ago

@detoma @jburkepile This is just a reminder that on 2022-02-02, we moved the field lens to its correct location (closer to the lyot stop). This should have changed the background (I forget if it did). In today's meeting, we talked about adding some February dates to the Gaussian data checking; it would be interesting to compare the before-and-after field lens moves.

jburkepile commented 5 months ago

Thanks for checking the change to the field lens location Ben and adding it to the ticket. Looks like you have data before and after the field lens change on Feb 2.

It looks like the stray light begins to diminish on Feb 8 and Feb 9 and by Feb 10, 2022 the data look good. It would be good to have Mike run Gaussian fits (analytic, 3- and 4-parameter) for a sample of days. How about reprocessing: Feb 1, Feb 2, Feb 7, Feb 10, Feb 14, 2022.

Are 5 days sufficient?

mgalloy commented 5 months ago

So not Jan 15, 16, and 19? or 20240409?