Weirdness with .remove_outliers method - Githubissues

OttoStruve / muler

A Python package for working with pipeline-produced spectra from IGRINS, HPF, and Keck NIRSPEC

https://muler.readthedocs.io

MIT License

15 stars 9 forks source link

Weirdness with .remove_outliers method #129

Open ericasaw opened 1 year ago

ericasaw commented 1 year ago

I had a code that worked a few commits ago, but now whenever I try to use the remove_outliers method for plotting I end up with the following assertion error:

AssertionError: The masked spectrum must have at least one pixel remaining

I've tried changing the threshold, but I end up with the same error every time. Maybe related to the issue I raised earlier (#128) since the error triggers in the apply_numpy_mask function in utilities.py?

gully commented 1 year ago

My hunch is that the rtell uncertainty is erroneous due to Kyle's recent changes. What do you get if you print spec.snr or spec.flux.value/spec.uncertainty.array?

ericasaw commented 1 year ago

for spec.snr it looks like the array is full of nan values and spec.uncertainty is a None type object (so the division gives an error)

ericasaw commented 1 year ago

Had some time to look into this issue this afternoon. The remove_outliers method computes a smooth spectrum using a gaussian process. If there are no uncertainties (which can happen if you reconstruct a spectrum from scratch ie. spec = IGRINSSpectrum(flux=fluxes, spectral_axis=wavelengths)) the smooth spectra method calculates a placeholder uncertainty array based on the median of the flux values.

The next part of this function is where things start to go wrong, but I'm not super familiar with gaussian processes or celerite2 to pinpoint exactly where the issue is. All I can tell is that a kernel is made, it computes the factorization of the GP covariance matrix, and because optimize_kernel is defaulted to false, that gp is used to predict the smooth flux values given the wavelengths. The error is arising because the mean model returns all nan values for the mean GP model so when the smoothed spectrum is used in the remove_outliers function all of the residuals are nan values which makes the mean absolute standard deviation nan too so the keep_indices mask becomes a mask where no indexes are True (therefore masking all of the pixels out of the spectra every time).

I've tried passing along the uncertainties when constructing a new IGRINSSpectrum object (as above) and not using the normalizing function before removing the outliers but neither seems to change the result. I also tried turning on the GP optimization, but it doesn't seem to make a difference. I'm kind of puzzled as to why this issue is only arising now since the GP must have worked in the past. I'm not sure what we are doing elsewhere in the code that would change the results here. @kfkaplan can you check if the remove_outliers method works with your current branch of muler? I'm starting to wonder if this is a local issue on my computer.

update: I tried reinstalling muler from source in a new clean environment and I run into the same issue so I am actually baffled here lol

kfkaplan commented 1 year ago

@ericasaw Can you send me the files you are running into these issues on? I will test it out on my computer and see if I run into the same problem.

kfkaplan commented 1 year ago

@ericasaw Thanks for the data. You are right this is occurring because because the Gaussian Process regression is failing. Specifically the line mean_model = opt_gp.predict(self.flux.value, t=self.wavelength.value) returns nans. I tried fiddling with the settings a bit and reading in different files but kept getting the same result. These data area not smoothly varying stellar spectra so maybe the fact the flux is near zero and there are a lot of emission lines might be throwing off the GP algorithm. I don't know enough about how it works to say. For now it might be easiest to not use outlier rejection, or write your own cut with a running median or something similar.

ericasaw commented 1 year ago

Ah! Thanks Kyle! I was afraid I had seriously messed up my environments for muler or something. I felt similarly about not knowing enough to fiddle deeply with the GP settings. For now I am just scaling using y-limits in plots (I don’t need to do any statistical things with remove_outliers for now—just making pretty plots). This should be next on the to-do list for fixes though since remove_outliers is a super useful method.

Still curious at what point this method stopped working, seems like a random thing to break. I checked the last push for the celerite2 package to see if there were any recent changes, but the last push was back in November.

On Feb 21, 2023, at 7:04 PM, kfkaplan @.***> wrote:

@ericasaw https://github.com/ericasaw Thanks for the data. You are right this is occurring because because the Gaussian Process regression is failing. Specifically the line mean_model = opt_gp.predict(self.flux.value, t=self.wavelength.value) returns nans. I tried fiddling with the settings a bit and reading in different files but kept getting the same result. These data area not smoothly varying stellar spectra so maybe the fact the flux is near zero and there are a lot of emission lines might be throwing off the GP algorithm. I don't know enough about how it works to say. For now it might be easiest to not use outlier rejection, or write your own signal-to-noise cut.

— Reply to this email directly, view it on GitHub https://github.com/OttoStruve/muler/issues/129#issuecomment-1439293000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMC7LDP475ZPIIXTNTPYGILWYVQ2TANCNFSM6AAAAAAVCJ5FZU. You are receiving this because you were mentioned.

kfkaplan commented 1 year ago

I dug through the commits on echelle.py and couldn't find anything obvious. The current code for def smooth_spectrum is fairly similar to the initial time it was committed. It is possible it was nothing we did, and that some recent update to celerite2 might be what broke things here.

kfkaplan commented 1 year ago

Okay I actually tracked the issue down to commit 6c9893a on Oct. 29, 2021 when it broke. I tried restoring def smooth_spectrum from the commit before that one but I got the same error, so the cause is the flux or uncertainty being fed to smooth_spectrum and not the code for that definition itself. I'm not exactly sure what is going on here but will try to look more into it tomorrow when I have some time.

kfkaplan commented 1 year ago

Okay I have done some further investigation. The issue is that a new assert was catching the fact that the fitting algorithm in def smooth_spectrum (which is called by def remove_outliers) was failing, at least for most of the IGRINS spectra.

Specifically this occured at during commit 6c9893a when

assert mask.sum() > 0,The masked spectrum must have at least one pixel remaining"

was added to def apply_numpy_mask(spec, mask) in utilities.py which is called by def apply_boolean_mask which is in the return of def remove_outliers.

Previously when you thought it "worked," I think def smooth_spectrum was failing anyway but the error just wasn't getting caught because the assert mask.sum() >0 wasn't there yet to catch it. The real solution will be to figure out why the Gaussian Processes regression algorithm used in def smooth_spectrum is not fitting the IGRINS spectra. Until then, we should avoid running def remove_outliers for IGRINS data.