desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
35 stars 24 forks source link

fixes/updates to read_tile_spectra #2156

Closed sbailey closed 7 months ago

sbailey commented 7 months ago

This PR has several bug fixes and updates for desispec.io.read_tile_spectra:

BUGFIX: fixes the matching of a redshift table (with one row per TARGETID) to a uncoadded spectra fibermap (with potentially multiple rows per TARGETID in a different order), when filtering by FIBER. The previous logic was simply wrong (mea culpa). I introduced a new desispec.util.argmatch function to isolate and test this matching logic, as well as introducing additional higher-level read_tile_spectra tests that would have caught the original bug but now pass. Usage: ii = argmatch(a, b) returns indices ii such that a[ii] == b is True for all entries.

BUGFIX: fixes case where the spectra files are gzipped, but the redrock files are not. It was previously assuming that the redrock files had the same level of gzipping or not; now it can be either way by using checkgzip. Unit tests explicitly test this now.

FEATURES for read_tile_spectra:

@schlafly I think you were the original author (or advocate?) of read_tile_spectra, or at least a primary user of it, so could you review this PR? Thanks.

schlafly commented 7 months ago

I think I only touched it a little bit when diagnosing large sky residuals and then added a bit to be able to get pernight spectra rather than cumulative spectra. I think I was just trying to find what I need; I don't believe I was an advocate or a major author.

That said, this looks good to me. The one thing I would have done differently is that I feel like the "duplicates in both a & b" case is awkward and you should either disallow it or give the full cartesian product of the matches. But maybe I'm missing cases where that's really what you want.

FWIW, related code in my "personal library" goes like this:

def match(a, b):
    sa = np.argsort(a)
    sb = np.argsort(b)
    ua = unique(a[sa])
    ub = unique(b[sb])
    if len(ua) != len(a):
        raise ValueError('All keys in a must be unique.')
    ind = np.searchsorted(a[sa], b)
    m = (ind >= 0) & (ind < len(a))
    matches = a[sa[ind[m]]] == b[m]
    m[m] &= matches
    return sa[ind[m]], np.flatnonzero(m)

where the usage here would be something like

mrr, mfm = match(rr['TARGETID'], sp.fibermap['TARGETID'])
rrx = rr[mrr]

But I guess is also different in that I do want to allow the "no match in a" case, whereas you clearly want to trigger an exception there. Possibly a more generic routine should have an additional assert_all argument or something that checks e.g. np.all(m).

Anyway, just in case you're interested in yet another take on this; feel free to ignore it!

sbailey commented 7 months ago

Thanks for the fast review. I do need to specifically support the case of argmatch(a,b) having duplicate entries in b, because the use case here is matching TARGETIDs in a redshift table (one row per TARGETID) to an uncoadded fibermap which can have multiple rows for each TARGETID. Supporting duplicates in a came along for the ride which seemed harmless, as did supporting a having extra entries not present in b. But the opposite is a specific error, since if b has entries not in a, then no indices ii could make a[ii] == b .

For the record, desitarget.geomask.match (not match_to) doesn't support duplicates, but it does support bi-directional matching with extras in either list such that ii, jj = match(a,b) -> a[ii] == b[jj].

araichoor commented 7 months ago

[I see the PR has been merged while I was typing... I just send my message for completeness]

about argmatch(): as @schlafly, I may be a bit concerned about the handling of duplicates; i.e. it can be risky if the code returns something, but that s not what the user would expect; for ease, you may want to throw an error if duplicates are present.

btw, @geordie666: even though it s written in the docstr that the arguments shouldn t contain duplicates, I always felt we should add such a test of uniqueness in desitarget.geomask.match(), and return an error if arguments have duplicates; because, as far as I remember, the current code does not crash, but the returned indexes do not make sense.

moustakas commented 7 months ago

btw, @geordie666: even though it s written in the docstr that the arguments shouldn t contain duplicates, I always felt we should add such a test of uniqueness in desitarget.geomask.match(), and return an error if arguments have duplicates; because, as far as I remember, the current code does not crash, but the returned indexes do not make sense.

Yes! This bit me shortly after I started using desitarget.geomask.match() in FastSpecFit and it was a little painful to (eventually) track down the issue.

My only comment---too late now that this has been merged---is that a function like this should perhaps have gone into desiutil over desispec.

schlafly commented 7 months ago

My objection is to allowing duplicates in both a & b; it just feels like that demands "undefined behavior" surrounding how you pick which indices map to which indices there. As long as there are duplicates only in one of the two, things are unambiguous.

geordie666 commented 7 months ago

@araichoor: Feel free to open an issue in desitarget and I'll update the match function to include a test for duplicates.