GreenBankObservatory / dysh

https://dysh.readthedocs.io
Other
9 stars 2 forks source link

Polynomial1D model not fitting at all, and other models are not matching gbtidl #86

Open etsmit opened 11 months ago

etsmit commented 11 months ago

First issue: Near misses

Using 'legendre' baseline model option fits nearly the same baseline as gbtidl, with most deviation at the edges of the baseline. The attached plot shows there is a difference in at least the third order between Dysh legendre and gbtidl, shown by the red line. (This is example dataset AGBT17A404...)

bline_comparison_2

Comparing the coefficients themselves is apples to oranges since dysh is using a Legendre model while gbtidl is using some form of basic polynomial sum(m) a_m x^m. Using the Polynomial1D model from the dysh side results in the second issue.

Second Issue: Polynomial1D fits are not working

A 2nd or 3rd order polynomial fit was applied to the data below, respectively. The 2nd order fit seemed to just find an average (indeed, all coefficients after the first are 0) and the 3rd order fit applied a straight, sloped line. Trying higher orders resulted in no change. The vertical black lines denote the regions of the fit.

bline_comparison3

bline_comparison_5

etsmit commented 10 months ago

GBTIDL fitting code: /home/gbtidl/2.10.1/gbtidl/pro/toolbox/ortho_fit.pro

`;+ ; Function uses general orthogonal polynomial to do least squares ; fitting. ; ;

This code came from Tom Bania's GBT_IDL work. Local ; modifications include: ;

; ; @param xx {in}{required} The x-values to use in the fit. ; @param yy {in}{required} The data to be fit at xx. ; @param nfit {in}{required}{type=integer} The order of the polynomial to fit. ; @param cfit {out}{required} On return, cfit has coefficients of ; the polynomial ;
; fit = sum(m) a_m x^m
; 
; because of round-off using this form gives unreliable results ; above about order 15 even for double precision. See <a ; href="ortho_poly.html">ortho_poly for a better discussion on the ; contents of cfit. ; @param rms {out}{required} The rms error for each polynomial up to ; order nfit. ; ; @returns polyfit array: contains information for fitting to arbitrary ; points using the recursion relations ; ; @examples ; fit a 3rd order polynomial to the data in !g.s[0] ;
;    yy = *(!g.s[0].data_ptr)
;    xx = dindgen(n_elements(yy))
;    f = ortho_fit(xx, yy, 3, cfit, rms)
; 
; ; @version $Id$ ;-

function ortho_fit,xx,yy,nfit,cfit,rms compile_opt idl2

; argument checks
if (n_elements(xx) ne n_elements(yy)) then $
    message, 'number of elements of xx is not equal to number of elements of yy'

if (nfit lt 0) then message, 'nfit must be >= 0'

n = n_elements(xx)

; initialize needed arrays
polyfit = dblarr(4,nfit+1)
rms = dblarr(nfit+1)
coef = dblarr(nfit+1)
; dblarr initializes these to 0.0, no need to do so here
c0 = coef 
c1 = coef
c2 = coef
cfit = coef 

fit = dblarr(n)
p0 = fit
p1 = fit
pnp1 = fit 
pn = fit 
pnm1 = fit

; copy values to ensure double precision 
x = double(xx)
f = double(yy)

; get the first couple of polynomials and fit coefficients

p0[0:n-1] = 1.0
xnorm = sqrt(total(p0^2))
p0 = p0/xnorm
c0[0] = 1./xnorm
coef[0] = total(f*p0)
polyfit[0,0] = c0[0]
polyfit[3,0] = coef[0]
rms[0] = total( (f-coef[0]*p0)^2)

if (nfit eq 0) then begin
    cfit = coef[0]*c0
    return, polyfit
endif

a = total(x*p0)
p1 = x - a*p0
xnorm = sqrt(total(p1^2))
c1[1] = 1.0
p1 = p1/xnorm
c1 = (c1 - a*c0)/xnorm

; first couple of fit coefficients
coef[1] = total(f*p1)
polyfit[0:1,1] = c1[0:1]
polyfit[3,1] = coef[1]
cfit = coef[0]*c0 + coef[1]*c1
fit = coef[0]*p0 + coef[1]*p1
rms[1] = total( (f-fit)^2 )

; loop up the order, using general recursion relation
pnm1 = p0
pn = p1
cnm1 = c0
cn = c1
;  print, 'cfit before loop', cfit
; in IDL, this loop is not entered if nfit < 2
for m=2,nfit do begin
    a = -1./total(x*pn*pnm1)
    b = -a*total(x*pn^2)
    ;  print,'iteration ',m,a,b
    pnp1 = (a*x + b)*pn + pnm1

    xnorm = sqrt(total(pnp1^2))
    pnp1 = pnp1/xnorm
    coef[m] = total(f*pnp1)
    ctmp = shift(cn,1)
    ctmp[0] = 0.
    cnp1 = (a*ctmp + b*cn + cnm1)/xnorm
    cfit = cfit + coef[m]*cnp1
    fit = fit + coef[m]*pnp1
    polyfit[0,m] = 1./xnorm
    polyfit[1,m] = b/xnorm
    polyfit[2,m] = a/xnorm
    polyfit[3,m] = coef[m]
    rms[m] = total( (f-fit)^2 )
    cnm1 = cn
    cn = cnp1
    pnm1 = pn
    pn = pnp1
endfor

rms = sqrt( rms/double(n) )

return, polyfit

end`

etsmit commented 10 months ago

I've verified that specutils is treating masked parts of the spectrum differently.

After recreating ortho_fit in my jupyter notebook, I can confirm that it produces the same outputs as gbtidl - for an unmasked spectrum.

I can produce a match between dysh and gbtidl for ABGT17A_404 scan 19 if there are no regions selected excluded in either - and the differences are no greater than 1e-9.

If I just use one region to fit (channels 0-750 selected in gbtidl, 751-819 excluded in dysh), it produces a match with no differences greater than 1e-4.

If I go back to using the full region selection before, the differences reach <1e-2, with <1e-3 in the selected regions of the spectrum. In this case, I'm selecting the (100, 380), (450,750) regions in gbtidl and excluding [(0,99),(379,449),(749,819)] in dysh. The difference between the spectra is functional, shown by the green line below.

Pasted image 20231019112417

Changing the dysh exclude regions by +/- 1 channel on either side (e.g. (381,451 instead of 379,449 and any permutation) doesn't result in any large difference, and certainly doesn't let us match gbtidl.

Also, the Polynomial1D errors cropping up above can be solved by using the LevMarLSQFitter or LMLSQFitter options.

etsmit commented 7 months ago

The numpy polyfit function returns a better fit than any of the dysh options, but it is still not quite right. Using the function below, I get 0.6717 as the sum of the absolute difference between the numpy and gbtidl models, while the dysh options return 2.0868.

    if include is not None:
        popt = np.polyfit(x[include],y[include],deg)
    else:
        popt = np.polyfit(x,y,deg)
    p = np.poly1d(popt)
    fit = p(x)
    if remove:
        return y-fit,fit
    else:
        return fit

Here are the differences between the two models compared to the GBTIDL result, overlaid on top of the GBTIDL-fitted spectrum. image

mpound commented 7 months ago

This is all very interesting as well as disturbing. Have you done the test where you create a spectrum with a known polynomial baseline added plus gaussian noise and see if both GBTIDL and dysh recover the polynomial? And repeat test with and without channel exclusion.
It's important to establish that GBTIDL is actually delivering the correct answer if that is going to be our reference.

etsmit commented 7 months ago

It appears that I was using the wrong boundaries to fit in dysh compared to GBTIDL. After fixing this, it looks like numpy polyfit returns the same values as the python implementation of ortho_fit.pro and the GBTIDL result. I suggest using numpy polyfit as a baseline fitting tool.

astrofle commented 1 week ago

Use numpy.polyfit for polynomial models, astropy modeling functions for the rest.

astrofle commented 1 week ago

Double check that GBTIDL is accurate by using synthetic data. Same for dysh. Use a grid of models and exclusion/inclusion regions.