Open etsmit opened 11 months ago
GBTIDL fitting code: /home/gbtidl/2.10.1/gbtidl/pro/toolbox/ortho_fit.pro
`;+ ; Function uses general orthogonal polynomial to do least squares ; fitting. ; ;
This code came from Tom Bania's GBT_IDL work. Local ; modifications include: ;
; fit = sum(m) a_m x^m ;; because of round-off using this form gives unreliable results ; above about order 15 even for double precision. See <a ; href="ortho_poly.html">ortho_poly for a better discussion on the ; contents of cfit. ; @param rms {out}{required} The rms error for each polynomial up to ; order nfit. ; ; @returns polyfit array: contains information for fitting to arbitrary ; points using the recursion relations ; ; @examples ; fit a 3rd order polynomial to the data in !g.s[0] ;
; yy = *(!g.s[0].data_ptr) ; xx = dindgen(n_elements(yy)) ; f = ortho_fit(xx, yy, 3, cfit, rms) ;; ; @version $Id$ ;-
function ortho_fit,xx,yy,nfit,cfit,rms compile_opt idl2
; argument checks
if (n_elements(xx) ne n_elements(yy)) then $
message, 'number of elements of xx is not equal to number of elements of yy'
if (nfit lt 0) then message, 'nfit must be >= 0'
n = n_elements(xx)
; initialize needed arrays
polyfit = dblarr(4,nfit+1)
rms = dblarr(nfit+1)
coef = dblarr(nfit+1)
; dblarr initializes these to 0.0, no need to do so here
c0 = coef
c1 = coef
c2 = coef
cfit = coef
fit = dblarr(n)
p0 = fit
p1 = fit
pnp1 = fit
pn = fit
pnm1 = fit
; copy values to ensure double precision
x = double(xx)
f = double(yy)
; get the first couple of polynomials and fit coefficients
p0[0:n-1] = 1.0
xnorm = sqrt(total(p0^2))
p0 = p0/xnorm
c0[0] = 1./xnorm
coef[0] = total(f*p0)
polyfit[0,0] = c0[0]
polyfit[3,0] = coef[0]
rms[0] = total( (f-coef[0]*p0)^2)
if (nfit eq 0) then begin
cfit = coef[0]*c0
return, polyfit
endif
a = total(x*p0)
p1 = x - a*p0
xnorm = sqrt(total(p1^2))
c1[1] = 1.0
p1 = p1/xnorm
c1 = (c1 - a*c0)/xnorm
; first couple of fit coefficients
coef[1] = total(f*p1)
polyfit[0:1,1] = c1[0:1]
polyfit[3,1] = coef[1]
cfit = coef[0]*c0 + coef[1]*c1
fit = coef[0]*p0 + coef[1]*p1
rms[1] = total( (f-fit)^2 )
; loop up the order, using general recursion relation
pnm1 = p0
pn = p1
cnm1 = c0
cn = c1
; print, 'cfit before loop', cfit
; in IDL, this loop is not entered if nfit < 2
for m=2,nfit do begin
a = -1./total(x*pn*pnm1)
b = -a*total(x*pn^2)
; print,'iteration ',m,a,b
pnp1 = (a*x + b)*pn + pnm1
xnorm = sqrt(total(pnp1^2))
pnp1 = pnp1/xnorm
coef[m] = total(f*pnp1)
ctmp = shift(cn,1)
ctmp[0] = 0.
cnp1 = (a*ctmp + b*cn + cnm1)/xnorm
cfit = cfit + coef[m]*cnp1
fit = fit + coef[m]*pnp1
polyfit[0,m] = 1./xnorm
polyfit[1,m] = b/xnorm
polyfit[2,m] = a/xnorm
polyfit[3,m] = coef[m]
rms[m] = total( (f-fit)^2 )
cnm1 = cn
cn = cnp1
pnm1 = pn
pn = pnp1
endfor
rms = sqrt( rms/double(n) )
return, polyfit
end`
I've verified that specutils is treating masked parts of the spectrum differently.
After recreating ortho_fit in my jupyter notebook, I can confirm that it produces the same outputs as gbtidl - for an unmasked spectrum.
I can produce a match between dysh and gbtidl for ABGT17A_404 scan 19 if there are no regions selected excluded in either - and the differences are no greater than 1e-9.
If I just use one region to fit (channels 0-750 selected in gbtidl, 751-819 excluded in dysh), it produces a match with no differences greater than 1e-4.
If I go back to using the full region selection before, the differences reach <1e-2, with <1e-3 in the selected regions of the spectrum. In this case, I'm selecting the (100, 380), (450,750) regions in gbtidl and excluding [(0,99),(379,449),(749,819)] in dysh. The difference between the spectra is functional, shown by the green line below.
Changing the dysh exclude regions by +/- 1 channel on either side (e.g. (381,451 instead of 379,449 and any permutation) doesn't result in any large difference, and certainly doesn't let us match gbtidl.
Also, the Polynomial1D errors cropping up above can be solved by using the LevMarLSQFitter or LMLSQFitter options.
The numpy polyfit function returns a better fit than any of the dysh options, but it is still not quite right. Using the function below, I get 0.6717 as the sum of the absolute difference between the numpy and gbtidl models, while the dysh options return 2.0868.
if include is not None:
popt = np.polyfit(x[include],y[include],deg)
else:
popt = np.polyfit(x,y,deg)
p = np.poly1d(popt)
fit = p(x)
if remove:
return y-fit,fit
else:
return fit
Here are the differences between the two models compared to the GBTIDL result, overlaid on top of the GBTIDL-fitted spectrum.
This is all very interesting as well as disturbing. Have you done the test where you create a spectrum with a known polynomial baseline added plus gaussian noise and see if both GBTIDL and dysh recover the polynomial? And repeat test with and without channel exclusion.
It's important to establish that GBTIDL is actually delivering the correct answer if that is going to be our reference.
It appears that I was using the wrong boundaries to fit in dysh compared to GBTIDL. After fixing this, it looks like numpy polyfit returns the same values as the python implementation of ortho_fit.pro and the GBTIDL result. I suggest using numpy polyfit as a baseline fitting tool.
Use numpy.polyfit for polynomial models, astropy modeling functions for the rest.
Double check that GBTIDL is accurate by using synthetic data. Same for dysh. Use a grid of models and exclusion/inclusion regions.
First issue: Near misses
Using 'legendre' baseline model option fits nearly the same baseline as gbtidl, with most deviation at the edges of the baseline. The attached plot shows there is a difference in at least the third order between Dysh legendre and gbtidl, shown by the red line. (This is example dataset AGBT17A404...)
Comparing the coefficients themselves is apples to oranges since dysh is using a Legendre model while gbtidl is using some form of basic polynomial sum(m) a_m x^m. Using the Polynomial1D model from the dysh side results in the second issue.
Second Issue: Polynomial1D fits are not working
A 2nd or 3rd order polynomial fit was applied to the data below, respectively. The 2nd order fit seemed to just find an average (indeed, all coefficients after the first are 0) and the 3rd order fit applied a straight, sloped line. Trying higher orders resulted in no change. The vertical black lines denote the regions of the fit.