desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
33 stars 24 forks source link

PSF with bad amp #2217

Closed julienguy closed 2 months ago

julienguy commented 2 months ago

Do not fit PSF for fiber bundle that lies on a CCD amplifier flagged as BADAMP.

This addresses issue #2202

sbailey commented 2 months ago

The PSF part of this is good, thanks. Since the branch psf-with-bad-amp is a combo of main + badampcals, it also contains non-PSF changes that we need to test end-to-end before merging into main. i.e. don't merge this yet. I'll do that end-to-end testing now.

sbailey commented 2 months ago

@julienguy the underlying commands appear to be ok for processing the PSF even with a bad amp, but there is some problem with how they interact with the outer MPI wrappers. e.g.

srun -N 1 -n 121 -c 2 --cpu-bind=cores desi_proc --cameras a0489 -n 20230428 -e 177975 --badamps b8B --mpi

processes the first batch of cameras r8,r9,z0,z4,z8,z9; but then gets stuck processing the remaining cameras b0,b4,b8,b9,r0,r4. They print messages like

INFO:specex.py:390:fitframe: MPI ranks 41-60 fitting PSF for b0 in job 11 at Sun Apr 21 19:31:05 2024

and then hang no log output for those cameras after that.

I don't think this is specific to the bad amp processing, since it also occurs when doing a full 30 cameras on a normal good night, e.g.

srun -N 3 -n 301 -c 2 --cpu-bind=cores  desi_proc   --cameras a0123456789 -n 20240408 -e 234956 --mpi

gets stuck on this branch, but not on main. I don't see any tracebacks or error messages. It smells like an MPI collective operation (barrier, bcast, gather) inside an "if" clause such that some ranks are waiting there but others bypass it and never join.

julienguy commented 2 months ago

The bugs are fixed. It's difficult to debug when some failing ranks are blocking the MPI processing without dumping any error.

sbailey commented 2 months ago

Update: the individual exposure PSF fitting now seems fine. I'm debugging downstream badamp bookkeeping, so don't merge this yet.

sbailey commented 2 months ago

Update: I have tested all pieces end-to-end by hand, including additional updates for propagating masks through fiberflats. I am now doing a final pipelined end-to-end test using 20210221 with b9C bad all night.

sbailey commented 2 months ago

Update: we found another corner case on merging PSFs where a bundle only partially overlaps a bad amp (night 20210221 badamp b9C). I've fixed that, albeit with lots of "running steps by hand" debugging to get through the queue.

I'm now running another complete night end-to-end test, but will have to wait for the queue for that.

sbailey commented 2 months ago

Full night end-to-end tests finally pass, e.g. see https://data.desi.lbl.gov/desi/users/sjbailey/spectro/redux/badampcals/tiles/cumulative/40964/20230503/tile-qa-40964-thru20230503.png which was self-consistently calibrated using data from that night instead of falling back to default psf and fiberflat for b8 due to b8B being bad. Ready to merge.