Closed julienguy closed 2 months ago
The PSF part of this is good, thanks. Since the branch psf-with-bad-amp is a combo of main + badampcals, it also contains non-PSF changes that we need to test end-to-end before merging into main. i.e. don't merge this yet. I'll do that end-to-end testing now.
@julienguy the underlying commands appear to be ok for processing the PSF even with a bad amp, but there is some problem with how they interact with the outer MPI wrappers. e.g.
srun -N 1 -n 121 -c 2 --cpu-bind=cores desi_proc --cameras a0489 -n 20230428 -e 177975 --badamps b8B --mpi
processes the first batch of cameras r8,r9,z0,z4,z8,z9; but then gets stuck processing the remaining cameras b0,b4,b8,b9,r0,r4. They print messages like
INFO:specex.py:390:fitframe: MPI ranks 41-60 fitting PSF for b0 in job 11 at Sun Apr 21 19:31:05 2024
and then hang no log output for those cameras after that.
I don't think this is specific to the bad amp processing, since it also occurs when doing a full 30 cameras on a normal good night, e.g.
srun -N 3 -n 301 -c 2 --cpu-bind=cores desi_proc --cameras a0123456789 -n 20240408 -e 234956 --mpi
gets stuck on this branch, but not on main. I don't see any tracebacks or error messages. It smells like an MPI collective operation (barrier, bcast, gather) inside an "if" clause such that some ranks are waiting there but others bypass it and never join.
The bugs are fixed. It's difficult to debug when some failing ranks are blocking the MPI processing without dumping any error.
Update: the individual exposure PSF fitting now seems fine. I'm debugging downstream badamp bookkeeping, so don't merge this yet.
Update: I have tested all pieces end-to-end by hand, including additional updates for propagating masks through fiberflats. I am now doing a final pipelined end-to-end test using 20210221 with b9C bad all night.
Update: we found another corner case on merging PSFs where a bundle only partially overlaps a bad amp (night 20210221 badamp b9C). I've fixed that, albeit with lots of "running steps by hand" debugging to get through the queue.
I'm now running another complete night end-to-end test, but will have to wait for the queue for that.
Full night end-to-end tests finally pass, e.g. see https://data.desi.lbl.gov/desi/users/sjbailey/spectro/redux/badampcals/tiles/cumulative/40964/20230503/tile-qa-40964-thru20230503.png which was self-consistently calibrated using data from that night instead of falling back to default psf and fiberflat for b8 due to b8B being bad. Ready to merge.
Do not fit PSF for fiber bundle that lies on a CCD amplifier flagged as BADAMP.
This addresses issue #2202