Fibers with possible TPCORR issues

sbailey commented 2 months ago

Issue #2283 identified fibers 1098, 3994, 4720, 4891 with problematic redshifts, possibly due to bad/unstable TPCORR issues (thanks to @rongpu and @julienguy for debugging those).

1098, 3994, 4720 ha ve a mismatched TPCORR between the cameras; that might be real or might be evidence of outliers pulling a bad fit. Investigate.

1098:

3994:

4720:

4891 was a known case with unstable TPCORR vs. time; from @schlafly :

Related: Issue #2304 is about how to mask these; this issue is about how to fix/improve the underlying problem if possible.

I'm assigning this to Kibo, but the schedule-driven fallback option would be to mask them even if we can't improve them.

schlafly commented 2 months ago

Some investigation of these fibers...

1098 looks to have fine TPCORR fits, though the B band has especially noisy measurements. The model is down by 4% in B, but B is too noisy to say whether that's actually a good fit. c.f. 1099: which is much less noisy in B and has an interesting change in behavior of ~1% around EXPID 100000.

I don't think there's an easy fix to TPCORR here.

3994 does look to just have a problematic mean in the r band, as well as to just be wrong. Fitting a new set of models to that fiber would probably just fix it.

4720 is off by ~3% in the mean in the r band; I don't really know why. It also shows a significant change in behavior around EXPID ~ 100000.

4891 is a mess and I don't think changes to TPCORR will help.

For anyone interested in digging in further, take a look at: https://data.desi.lbl.gov/desi/users/schlafly/tpcorr-jura.fits The plotting code is in ~schlafly/desi/misc/py/skyplot.py:plot_tpcorr.

I think it would be technically straightforward to refit these. I would hazard, but have not checked, that the change in behavior around EXPID ~ 100000 is from the time when we started homing the fibers at the end of the night, so that they have a consistent (x, y) position when used to measure the flat---that's at least a special time that I remember, though I'm surprised that the noise doesn't change around then. I think it's not technically too demanding to do a new set of fits if we wanted that.

schlafly commented 2 months ago

I put a new set of tpcorr files using all data in jura at: ls /global/homes/s/schlafly/desiproject/users/schlafly/tpcorr-jura-desi-spectro-calib/spec/ I note that the tpcorr param code zeros the fibers in the region of r4 that once had bad CTE; we should think of a better way to implement that.

https://github.com/desihub/desispec/blob/main/py/desispec/tpcorrparam.py#L225-L227

I have not stared at these meaningfully. The one test I have done is to compare the means for r7: One can see that with the exception of one large outlier, there is very good correlation. The one outlier is fiber 3994, consistent with expectations from the above plots. It shifts in value by 9%, which is not quite what I would have eyeballed but is clearly in the ballpark.

schlafly commented 2 months ago

Playing the same game with r9, things again are nicely correlated. The two outliers are 4891 and 4720, both from this ticket. I haven't checked 1098 since it is less obviously problematic to me from the tpcorr measurements.

schlafly commented 2 months ago

I need to leave this thread, but some interesting plots. Here's the TPCORR residuals in the z band. c.f. the model

Note that these are model and residuals after the model has been subtracted, so we are happy that the scale of the model is larger than the residuals---the model is removing a lot of real signal.

The residuals have some interesting structure... The first band ending at 750 corresponds to expid 97963. That indeed corresponds to the summer 2021 electronics shutdown, and I think when we started routinely homing the fibers. The band starting at 6013 corresponds to 2023/10/2, which @rongpu will remember from https://github.com/desihub/desispec/issues/2172. Similarly the band starting at 6618 corresponds to 2023/12/03, which also corresponds to a white spot misalignment time period; the structure out to ~6950 is all in December 2023 while the white spot alignment was iffy.

There is continued structure more recently, which likely justifies doing a new set of PCAs like I provided above, though I have no idea what's actually varying here!

schlafly commented 2 months ago

I spent a little longer here. If one wants to give the model enough flexibility to handle the problematic dome placement nights, that means going out to 6 PCA components (we currently use 2). It's not obvious we care, but it looks like this leads to ~1% sky errors. That's probably more than we want to change for kibo.

julienguy commented 1 month ago

I am comparing the TPCORR values in different bands, looking at the jura files in /global/homes/s/schlafly/desiproject/users/schlafly/tpcorr-jura-desi-spectro-calib/spec/sm*/tpcorrparam-sm*-*.fits .

The first plot is the correlation between Z and R. The second between B and R.

We see a very tight correlation for most parameters except for the PCA parameters between B and R. I suspect it is because of noise in B.
The slopes for the SPATIAL coefficients PARAM[1] and PARAM[2] are not 1. We see a strong deviation between B and R, and more modest between R and Z. The change of slope is monotonous with wavelength. This is intriguing ...

tpcorr-z-vs-r

tpcorr-b-vs-r

schlafly commented 1 month ago

FWIW, param[1] and param[2] are just the coefficients on the terms proportional to x & y in the modeling within the patrol disk. I think naively I would have guessed that they are either all scaled versions of one another, but different dependences of the different parameters seems extra complicated to me.

schlafly commented 1 month ago

And FWIW, here's fiber 1098 in B near the bright sky line. I don't know what the deal is with the bad column there (more CTE?) but it's presumably the problem.

julienguy commented 1 month ago

I looks like a charge trap in the active region of the CCD affecting the parallel transfer. We had found a bunch of those for Y1 with Rongpu. We were able to find the first pixel affected by looking at the preprocessed dome flat images. I will see if we can do the same here.

schlafly commented 1 month ago

I put 2 and 6 component versions of the tpcorr files with the associated DESI_SPECTRO_CALIB files here: https://data.desi.lbl.gov/desi/users/schlafly/tpcorr-desi-spectro-calib/2comp/ https://data.desi.lbl.gov/desi/users/schlafly/tpcorr-desi-spectro-calib/6comp/ i.e., one should be able to do export DESI_SPECTRO_CALIB=$DESI_ROOT/users/schlafly/tpcorr-desi-spectro-calib/2comp and run a product.

forero commented 1 month ago

@schlafly @julienguy I will have time tomorrow (Wed 14.08.24) to work on this. Please let me know which scripts and procedures I should run to diagnose the new results for the 2cmp vs 6comp products. Thanks!

schlafly commented 1 month ago

I spent a little bit today looking at this; there is some cleanup we can still do. Here is the kind of plot I made:

https://data.desi.lbl.gov/desi/users/schlafly/tpcorr-new-vs-old.pdf https://data.desi.lbl.gov/desi/users/schlafly/tpcorr-new-vs-old-spatial.pdf https://data.desi.lbl.gov/desi/users/schlafly/tpcorr-new-vs-old-pca.pdf

The correlations in the means (first set of plots) are quite tight and every discrepancy means something. A few of them are in the ticket above, and others are also important. Initial versions of these plots showed discrepancies in the problematic region of z5 (near old hot column) and r4 (in old bad CTE) that I've tried to repair in newer versions of these files. I've also reset fiber 1098 to have no TPCORR fits in the B band because the sky line lands near some band parallel CTE.

pdf = PdfPages('/global/homes/s/schlafly/desiproject/users/schlafly/tpcorr-new-vs-old-pca.pdf')
for fn0 in fn:
    new = fits.getdata(fn0, 'PCA')
    old = fits.getdata('/global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/'+fn0, 'PCA')
    clf()
    for i in range(2):
        subplot(2, 1, i+1)
        newvec = np.linalg.inv(new.dot(new.T)).dot(new.dot(old[i])).dot(new)
        plot(old[i]) ; plot(newvec)
        title('PCA ' + str(i))
        xlabel('fiber')
        ylabel('coeff')
    suptitle(fn0)
    pdf.savefig()
pdf.close()

pdf = PdfPages('/global/homes/s/schlafly/desiproject/users/schlafly/tpcorr-new-vs-old-spatial.pdf')
for fn0 in fn:
    new = fits.getdata(fn0, 'SPATIAL')
    old = fits.getdata('/global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/'+fn0, 'SPATIAL')
    clf()
    for i in range(1, 10):
        subplot(3, 3, i)
        plot(old['PARAM'][:, i], new['PARAM'][:, i], '+')
        title('PARAM ' + str(i))
        xlabel('old')
        ylabel('new')
    suptitle(fn0)
    pdf.savefig()

pdf = PdfPages('/global/homes/s/schlafly/desiproject/users/schlafly/tpcorr-new-vs-old.pdf')
for fn0 in fn:
    new = fits.getdata(fn0, 'MEAN')
    old = fits.getdata('/global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/'+fn0, 'MEAN')
    clf()
    plot(old, new, '+')
    title(fn0)
    xlabel('old')
    ylabel('new')
    plot([0.95, 1.05], [0.95, 1.05])
    pdf.savefig()
pdf.close()

schlafly commented 1 month ago

And FYI, the code to generate the tpcorr files is here: https://github.com/desihub/desispec/pull/2318 You can create new ones by doing something like

tpcorr = fits.getdata('users/schlafly/tpcorr-jura.fits')
tpcorrparam.make_tpcorrparam_files(tpcorr=tpcorr, dirname='/global/homes/s/schlafly/desiproject/users/schlafly/tpcorr-jura-desi-spectro-calib-2comp-new/', npca=2)

schlafly commented 1 month ago

These all look good to me. I did create a new directory as part of doing this that tries to handle things like r4 and z5 (which once had lots of bad fibers) in a better way. That is /global/homes/s/schlafly/desiproject/users/schlafly/tpcorr-jura-desi-spectro-calib-2comp-new/ I would recommend we use that for kibo.

I did not spot check all of the outliers in the spatial plots. There is an outstanding ~failure mode there where a fiber becomes stuck during year 2 and so there is a good number of measurements throughout the fiber patrol disk and then thousands of measurements at one specific point. Some fibers also have slow drifts in TPCORR that are intended to be soaked up by the PCA fit (and which I do not physically understand). The spatial modeling proceeds the PCA modeling, and so the spatial modeling can end up seeing that the particular region where the fiber got stuck has a significantly different TPCORR than the rest of the patrol disk. This can lead to some weird torquing of the fit to accommodate the weird, heavily weighted position.

Likely we should exclude places where the fiber was stuck from the fits, or figure out a scheme to deweight them (no particular region of the patrol disk gets more weight than X). The code doesn't do that at present. The cases I have looked at have led to slightly weird spatial models but not crazy ones---i.e., these fibers won't have great TPCORR but also won't have horrible ones.

forero commented 1 month ago

I've been comparing TPCORR values between tpcorr6 and tpcorr2, looking at 10 out of 220 exposures so far. I made a graph to show the differences for cameras B, R, and Z, focusing on big differences that were more than 5 sigma around the mean.

tpcorr_difference_plot

Here's a summary of what I found:

Outlier Summary

Camera	Total Outliers	Unique Outlier Fibers
B	85	55
R	72	57
Z	111	29

In total, I found 127 unique fibers with big differences across all cameras.

What really stands out to me are these things:

There's only one fiber (number 2278) that shows big differences in all three cameras. I'm curious about why this one is so special.
Camera Z has the biggest differences:
- The largest difference I saw was 0.1384 for Fiber 2679.
- I noticed a group of fibers (from 2670 to 2685) that keep showing large differences. I wonder if there's a reason for this cluster.
I looked at the top 5 outliers for each camera, and here's what I found:
- In Camera B, the biggest difference was -0.0314 for Fiber 2278.
- For Camera R, Fiber 1417 had the largest difference at -0.0253.
- Camera Z really stood out, with Fiber 2679 showing a huge difference of 0.1384.

Should we expect such big differences in Camera Z, especially for those fibers between 2670 and 2685? And what's so special about Fiber 2278 that it shows up in all cameras?

My next steps are to look at the other 190 exposures and see if I find the same patterns. I also want to check if these differences change the redshift results.

sbailey commented 1 month ago

Thanks @forero @schlafly .

@forero please provide a snippet of code defining "TPCORR Difference". e.g. are you evaluating the TPCORR when the fiber is at the center of its patrol region?

@forero please also do a similar study comparing the current 2-parameter model in /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/ to the updated model in /global/cfs/cdirs/desi/users/schlafly/tpcorr-desi-spectro-calib/2comp/ . In that case, we expect very tight agreement for nearly all fibers, except for ones that have a known TPCORR-related problem, e.g. 1098, 3994, 4720, 4891.

schlafly commented 1 month ago

I suspect these are fibers in the old region on the old z5 that had variable contamination from the hot column. I did try to zero out the contribution of this area to TPCORR in /global/homes/s/schlafly/desiproject/users/schlafly/tpcorr-jura-desi-spectro-calib-2comp-new/ . I don't actually know if these fibers are masked in practice.

forero commented 1 month ago

@sbailey Here's how I gather the data for each specprod:

# Set environment variable
os.environ['SPECPROD'] = specprod

# Define the exposures table
exp_file = os.path.join(os.environ.get('DESI_SPECTRO_REDUX'), specprod, f'exposures-{specprod}.csv')

# Read exposures
exps = Table.read(exp_file)

# Gather the data
tpcorrdata = desispec.tpcorrparam.gather_tpcorr(exps)

And then for every exposure in the two tpcorrdata :

for i_rec in range(n_records):
    data_A = tpcorrdata_specprod_A[i_rec]
    data_B = tpcorrdata_specprod_B[i_rec]

    diff = data_B['TPCORR'] - data_A['TPCORR']

I don't know whether that TPCORR read by desispec is computed at the center of its patrol region (but i guess it's computed at the actual fiberassignment location?)

forero commented 1 month ago

I compared 'MEAN' values in the tpcorrparam--.fits files between two models:

Model A: /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/
Model B: /global/cfs/cdirs/desi/users/schlafly/tpcorr-desi-spectro-calib/2comp/

I found extreme differences (>5 sigma) in the 'MEAN' values for 49 fibers. Here's the list of fiber IDs showing these extreme differences:

[817, 1063, 1098, 1935, 2171, 2236, 2250, 2251, 2252, 2253, 2254, 2257, 2259, 2260, 2261, 2262, 2263, 2265, 2267, 2271, 2478, 2628, 2668, 2671, 2672, 2674, 2675, 2676, 2677, 2678, 2679, 2680, 2681, 2682, 2683, 2684, 3065, 3518, 3546, 3754, 3799, 3974, 3987, 3994, 4334, 4428, 4494, 4720, 4891]

This list includes the fibers you mentioned: 1098, 3994, 4720, and 4891. But there are also two clusters of consecutive fibers showing extreme differences: one from 2250-2271 and another from 2671-2684 (the one I saw in the reduced data).

schlafly commented 1 month ago

Those clumps are the region on r4 (used to have bad CTE) and z5 (used to have a bad hot column). I think these are fine, but we should use the newer version of these files where I treat those regions a little differently.

forero commented 1 month ago

tpcorr_differences(4)

sbailey commented 1 month ago

I have created a directory /global/cfs/cdirs/desi/spectro/desi_spectro_calib/kibotest which is desi_spectro_calib/trunk updated with @schlafly's latest 2-parameter files from /global/cfs/cdirs/desi/users/schlafly/tpcorr-jura-desi-spectro-calib-2comp-new/spec.

@forero thanks for the plots, but unfortunately I had missed in the history that Eddie had a newer 2-parameter model to propose and I pointed you at the older one. From talking with Eddie about his checks and yours, I think everything is ok, but if it is easy for you to remake the plots with /global/cfs/cdirs/desi/spectro/desi_spectro_calib/kibotest that would be a nice final check. Feature creep: it would also be nice to fix the y-range of all figures to be the same to make it easier to see what are big outliers vs. just a zoomed in/out plot. Maybe +/- 0.02 with outliers clipped to the same range so that they appear on the plot at the edges even if they are large. Thanks.

The plan is to use desi_spectro_calib/kibotest for the k1 test run, and if that looks good commit it to trunk to use with daily and make a tag for kibo.

sbailey commented 1 month ago

@schlafly could you re-summarize which fibers have TPCORR problems in Jura which are expected to be fixed by this PCA update in Kibo, vs. ones that are known to still have problems and should be flagged as FIBERSTATUS VARIABLETHRU? Fibers on my radar from the tracking spreadsheet (don't post link here!): 466, 1098, 3994, 4720, 4891.

schlafly commented 1 month ago

I expect this update to primarily improve the following cases:

Fiber 1098: we are just zeroing this one out manually in the tpcorr creation because its B band measurements are affected by parallel CTE issues and aren't reliable.
Fiber 3994: fixed naturally in update (just using more data)
Fiber 4720: fixed naturally in update (just using more data) This update will also generically improve TPCORR for all fibers by small amounts, and improve the handling of stuck fibers and fibers in the problematic regions of r4 and z5 that we either couldn't model at all in the initial tpcorr measurements or which we modeled badly.

Will not help:

Fiber 466: this one hadn't been on my radar. It has a large spatial correction; probably the cubic isn't enough and we should go to higher order, but let's not pursue that. Its residuals after clipping are quite nominal but there are small parts of the patrol radius that get pretty ugly. I can vaguely identify it on the basis of having a larger inferred sigma based on the 5-95% quantiles than based on the 16-84% quantiles (marked in orange below), but there would be a lot more to stare at if we wanted to mark all of those.
Fiber 4891: this fiber is just problematic and varies like crazy.

Fibers with high variance in tpcorr:

1098, B: this ticket. Probably not really high variance---instead, it appears to have high variance because it's affected by parallel CTE. This fiber may be fine now that we are explicitly zeroing its TPCORR modeling, or it may still be sick because it needs some TPCORR but we have given up on guessing what it needs based on its corrupted line measurements.
4891, all bands: genuinely weird There are a few more at the 0.03 mag level and start to be lots at the 0.02 mag level. The fibers on z5 (near old hot column) and r4 (near old bad CTE) often stand out in TPCORR but that's because of other instrumental effects that we're trying to mask appropriately here rather than because of actual issues with the fiber throughput.

The only other fiber that has std(TPCORR) > 0.02 in all bands and >0.03 in at least one band is 817. If you demand >= 0.02 in two bands, you get additionally 2171, 2277, and 2682. I would not be surprised if this turns out to be a broken fiber list but I haven't checked.

Bonus: what fiber 466 TPCORR look like raw and after removing the TPCORR spatial model.

Here's a gallery of some other fibers that have similar amplitude spatial corrections. https://data.desi.lbl.gov/desi/users/schlafly/tpcorr-spatial-badfib.pdf

forero commented 1 month ago

@sbailey

The total number of fibers showing extreme differences increases from 49 to 73

Doing the same thing: comparing 'MEAN' values in the tpcorrparam--.fits files between two models:

Model A: /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/
Model B: /global/cfs/cdirs/desi/spectro/desi_spectro_calib/kibotest/

I found extreme differences (>5 sigma) in the 'MEAN' values for 73 unique fibers. Here's the list of fiber IDs

[158, 168, 576, 662, 685, 691, 815, 845, 1042, 1098, 1221, 1262, 1288, 1336, 1424, 1570, 1630, 1935, 2000, 2066, 2074, 2143, 2171, 2251, 2254, 2260, 2277, 2598, 2663, 2667, 2669, 2670, 2673, 2674, 2682, 2918, 2932, 3209, 3244, 3310, 3322, 3519, 3799, 3937, 3987, 3992, 3994, 4007, 4014, 4125, 4160, 4193, 4208, 4215, 4240, 4346, 4428, 4455, 4494, 4504, 4515, 4526, 4529, 4621, 4705, 4720, 4765, 4891, 4948, 4952, 4961, 4974, 4978]

This list still includes fibers: 1098, 3994, 4720, and 4891. However, the patterns of consecutive fibers have changed:

There's a small cluster of consecutive fibers showing extreme differences from 2667-2674.
Another small cluster appears from 3987-3994.
There are several pairs of consecutive or near-consecutive fibers throughout the list, such as 158-168, 685-691, 2251-2254-2260, 4515-4526-4529, and 4961-4974-4978.

julienguy commented 1 month ago

Masked column with charge trap in pixmask-sm5-b-20201014.fits.gz and pixmask-sm5-b2-20210901.fits.gz. Set BADCOLUMNFIBERS: 1097,1098 for all configs of sm5-b.yaml. I checked the trap was there at the beginning of the survey.

sbailey commented 1 month ago

Summary:

fiber 4891 flagged as VARIABLETHRU
fiber 1097,1098 masked as BADCOLUMNFIBERS
TPCORR parameters update in desi_spectro_calib
Overall vetting of new parameters is more complete than previous parameters, and thus changes from the previous results are expected to be improvements (i.e. different, but new is better).

Closing for Kibo.

desihub / desispec

Fibers with possible TPCORR issues #2305

Outlier Summary