fixes/updates to read_tile_spectra

sbailey commented 7 months ago

This PR has several bug fixes and updates for desispec.io.read_tile_spectra:

BUGFIX: fixes the matching of a redshift table (with one row per TARGETID) to a uncoadded spectra fibermap (with potentially multiple rows per TARGETID in a different order), when filtering by FIBER. The previous logic was simply wrong (mea culpa). I introduced a new desispec.util.argmatch function to isolate and test this matching logic, as well as introducing additional higher-level read_tile_spectra tests that would have caught the original bug but now pass. Usage: ii = argmatch(a, b) returns indices ii such that a[ii] == b is True for all entries.

@geordie666 and @araichoor heads up: desispec.util.argmatch is similar to desitarget.geomask.match_to, but it supports duplicate entries which match_to does not. I think argmatch is a superset of the functionality of match_to, but it doesn't support the bi-directional matching of desitarget.geomask.match (which match_to uses under-the-hood), and I wasn't sure if there were other use cases for match_to that I was missing in my tests. We could consider merging these in the future, but for now I introduced this new function in desispec instead of expanding the functionality of desitarget.geomask.match_to because I didn't want to risk breaking something there.

BUGFIX: fixes case where the spectra files are gzipped, but the redrock files are not. It was previously assuming that the redrock files had the same level of gzipping or not; now it can be either way by using checkgzip. Unit tests explicitly test this now.

FEATURES for read_tile_spectra:

night argument is now optional; will use latest night on disk if not specified.
group argument defaults to "cumulative" instead of None; the previous default produced paths structures that don't exist.
when filtering by fiber, skip reading any petals that aren't covered by those fibers.
- full disclosure: this uses the filename to derive the petal to know whether to skip the file. Normally it is bad practice to rely upon the filename to get metadata instead of using the file contents, but the whole point here is to avoid reading unnecessary files, and the function is already assuming that we are reading tiles within the standard datamodel directory structure and names, so I think this is ok.
adds functions in desispec.test.util for generating fake Spectra objects for testing.

@schlafly I think you were the original author (or advocate?) of read_tile_spectra, or at least a primary user of it, so could you review this PR? Thanks.

schlafly commented 7 months ago

I think I only touched it a little bit when diagnosing large sky residuals and then added a bit to be able to get pernight spectra rather than cumulative spectra. I think I was just trying to find what I need; I don't believe I was an advocate or a major author.

That said, this looks good to me. The one thing I would have done differently is that I feel like the "duplicates in both a & b" case is awkward and you should either disallow it or give the full cartesian product of the matches. But maybe I'm missing cases where that's really what you want.

FWIW, related code in my "personal library" goes like this:

def match(a, b):
    sa = np.argsort(a)
    sb = np.argsort(b)
    ua = unique(a[sa])
    ub = unique(b[sb])
    if len(ua) != len(a):
        raise ValueError('All keys in a must be unique.')
    ind = np.searchsorted(a[sa], b)
    m = (ind >= 0) & (ind < len(a))
    matches = a[sa[ind[m]]] == b[m]
    m[m] &= matches
    return sa[ind[m]], np.flatnonzero(m)

where the usage here would be something like

mrr, mfm = match(rr['TARGETID'], sp.fibermap['TARGETID'])
rrx = rr[mrr]

But I guess is also different in that I do want to allow the "no match in a" case, whereas you clearly want to trigger an exception there. Possibly a more generic routine should have an additional assert_all argument or something that checks e.g. np.all(m).

Anyway, just in case you're interested in yet another take on this; feel free to ignore it!

sbailey commented 7 months ago

Thanks for the fast review. I do need to specifically support the case of argmatch(a,b) having duplicate entries in b, because the use case here is matching TARGETIDs in a redshift table (one row per TARGETID) to an uncoadded fibermap which can have multiple rows for each TARGETID. Supporting duplicates in a came along for the ride which seemed harmless, as did supporting a having extra entries not present in b. But the opposite is a specific error, since if b has entries not in a, then no indices ii could make a[ii] == b .

For the record, desitarget.geomask.match (not match_to) doesn't support duplicates, but it does support bi-directional matching with extras in either list such that ii, jj = match(a,b) -> a[ii] == b[jj].

araichoor commented 7 months ago

[I see the PR has been merged while I was typing... I just send my message for completeness]

about argmatch(): as @schlafly, I may be a bit concerned about the handling of duplicates; i.e. it can be risky if the code returns something, but that s not what the user would expect; for ease, you may want to throw an error if duplicates are present.

btw, @geordie666: even though it s written in the docstr that the arguments shouldn t contain duplicates, I always felt we should add such a test of uniqueness in desitarget.geomask.match(), and return an error if arguments have duplicates; because, as far as I remember, the current code does not crash, but the returned indexes do not make sense.

moustakas commented 7 months ago

btw, @geordie666: even though it s written in the docstr that the arguments shouldn t contain duplicates, I always felt we should add such a test of uniqueness in desitarget.geomask.match(), and return an error if arguments have duplicates; because, as far as I remember, the current code does not crash, but the returned indexes do not make sense.

Yes! This bit me shortly after I started using desitarget.geomask.match() in FastSpecFit and it was a little painful to (eventually) track down the issue.

My only comment---too late now that this has been merged---is that a function like this should perhaps have gone into desiutil over desispec.

schlafly commented 7 months ago

My objection is to allowing duplicates in both a & b; it just feels like that demands "undefined behavior" surrounding how you pick which indices map to which indices there. As long as there are duplicates only in one of the two, things are unambiguous.

geordie666 commented 7 months ago

@araichoor: Feel free to open an issue in desitarget and I'll update the match function to include a test for duplicates.

desihub / desispec

fixes/updates to read_tile_spectra #2156