Closed sbailey closed 7 months ago
I think I only touched it a little bit when diagnosing large sky residuals and then added a bit to be able to get pernight spectra rather than cumulative spectra. I think I was just trying to find what I need; I don't believe I was an advocate or a major author.
That said, this looks good to me. The one thing I would have done differently is that I feel like the "duplicates in both a & b" case is awkward and you should either disallow it or give the full cartesian product of the matches. But maybe I'm missing cases where that's really what you want.
FWIW, related code in my "personal library" goes like this:
def match(a, b):
sa = np.argsort(a)
sb = np.argsort(b)
ua = unique(a[sa])
ub = unique(b[sb])
if len(ua) != len(a):
raise ValueError('All keys in a must be unique.')
ind = np.searchsorted(a[sa], b)
m = (ind >= 0) & (ind < len(a))
matches = a[sa[ind[m]]] == b[m]
m[m] &= matches
return sa[ind[m]], np.flatnonzero(m)
where the usage here would be something like
mrr, mfm = match(rr['TARGETID'], sp.fibermap['TARGETID'])
rrx = rr[mrr]
But I guess is also different in that I do want to allow the "no match in a" case, whereas you clearly want to trigger an exception there. Possibly a more generic routine should have an additional assert_all argument or something that checks e.g. np.all(m).
Anyway, just in case you're interested in yet another take on this; feel free to ignore it!
Thanks for the fast review. I do need to specifically support the case of argmatch(a,b)
having duplicate entries in b
, because the use case here is matching TARGETIDs in a redshift table (one row per TARGETID) to an uncoadded fibermap which can have multiple rows for each TARGETID. Supporting duplicates in a
came along for the ride which seemed harmless, as did supporting a
having extra entries not present in b
. But the opposite is a specific error, since if b
has entries not in a
, then no indices ii
could make a[ii] == b
.
For the record, desitarget.geomask.match
(not match_to
) doesn't support duplicates, but it does support bi-directional matching with extras in either list such that ii, jj = match(a,b)
-> a[ii] == b[jj]
.
[I see the PR has been merged while I was typing... I just send my message for completeness]
about argmatch()
:
as @schlafly, I may be a bit concerned about the handling of duplicates; i.e. it can be risky if the code returns something, but that s not what the user would expect; for ease, you may want to throw an error if duplicates are present.
btw, @geordie666: even though it s written in the docstr that the arguments shouldn t contain duplicates, I always felt we should add such a test of uniqueness in desitarget.geomask.match()
, and return an error if arguments have duplicates; because, as far as I remember, the current code does not crash, but the returned indexes do not make sense.
btw, @geordie666: even though it s written in the docstr that the arguments shouldn t contain duplicates, I always felt we should add such a test of uniqueness in desitarget.geomask.match(), and return an error if arguments have duplicates; because, as far as I remember, the current code does not crash, but the returned indexes do not make sense.
Yes! This bit me shortly after I started using desitarget.geomask.match()
in FastSpecFit
and it was a little painful to (eventually) track down the issue.
My only comment---too late now that this has been merged---is that a function like this should perhaps have gone into desiutil
over desispec
.
My objection is to allowing duplicates in both a & b; it just feels like that demands "undefined behavior" surrounding how you pick which indices map to which indices there. As long as there are duplicates only in one of the two, things are unambiguous.
@araichoor: Feel free to open an issue in desitarget
and I'll update the match
function to include a test for duplicates.
This PR has several bug fixes and updates for
desispec.io.read_tile_spectra
:BUGFIX: fixes the matching of a redshift table (with one row per TARGETID) to a uncoadded spectra fibermap (with potentially multiple rows per TARGETID in a different order), when filtering by FIBER. The previous logic was simply wrong (mea culpa). I introduced a new
desispec.util.argmatch
function to isolate and test this matching logic, as well as introducing additional higher-levelread_tile_spectra
tests that would have caught the original bug but now pass. Usage:ii = argmatch(a, b)
returns indicesii
such thata[ii] == b
is True for all entries.desispec.util.argmatch
is similar todesitarget.geomask.match_to
, but it supports duplicate entries whichmatch_to
does not. I thinkargmatch
is a superset of the functionality ofmatch_to
, but it doesn't support the bi-directional matching ofdesitarget.geomask.match
(whichmatch_to
uses under-the-hood), and I wasn't sure if there were other use cases formatch_to
that I was missing in my tests. We could consider merging these in the future, but for now I introduced this new function in desispec instead of expanding the functionality ofdesitarget.geomask.match_to
because I didn't want to risk breaking something there.BUGFIX: fixes case where the spectra files are gzipped, but the redrock files are not. It was previously assuming that the redrock files had the same level of gzipping or not; now it can be either way by using
checkgzip
. Unit tests explicitly test this now.FEATURES for
read_tile_spectra
:night
argument is now optional; will use latest night on disk if not specified.group
argument defaults to "cumulative" instead of None; the previous default produced paths structures that don't exist.desispec.test.util
for generating fake Spectra objects for testing.@schlafly I think you were the original author (or advocate?) of read_tile_spectra, or at least a primary user of it, so could you review this PR? Thanks.