desihub / desitarget

DESI Targeting
BSD 3-Clause "New" or "Revised" License
18 stars 23 forks source link

geomask.match(): check for duplicates #811

Closed araichoor closed 1 week ago

araichoor commented 8 months ago

the desitarget.geomask.match() function only works if there are no duplicates in the input arguments. this is correctly written in the docstr; however, not everyone carefully reads a docstr :)

if I m correct, with the current version, when the arguments contain duplicates, the code just proceeds, but the outputs are not meaningful; so it s likely the user will be hurt by that.

that s why it would be better/safer to add a check that none of the arguments have duplicates; and just return an errror if they do. something like:

if np.unique(A).size != len(A):
    msg = "A contains duplicates"
    log.error(msg)
    raise ValueError(msg)
if np.unique(B).size != len(B):
    msg = "B contains duplicates"
    log.error(msg)
    raise ValueError(msg)

also, with writing the code snippet above, I m wondering if A and B should only be 1d-array...

araichoor commented 6 months ago

@geordie666, another suggestion for this one:

actually, maybe:

would that make sense?

the motivation for that is that this check can take few seconds for 1e8 arrays, and if one is really looking for speed -- and already knows that the inputs don t have duplicates, this could be a "loss" of time.

[context: I ve a in-development fiberassign update using match(), to speed up the code when running at once several hundreds of tiles with several tens of millions of targets]

geordie666 commented 6 months ago

Ah, yes, sorry, this fell off my plate for a while.

Would it be better to set the default to check_for_dups=False for strict backward-compatibility? It's likely people use this function in other places where speed is important.

araichoor commented 2 weeks ago

one suggestion here, if we re looking for speed for this check -- and this may justify setting check_for_dups=True:

if check_for_dups=True, maybe the check can be done just after these lines: https://github.com/desihub/desitarget/blob/971d9d5c6b8beea27d1722306f2d54ee5ff82389/py/desitarget/geomask.py#L91-L93 with something like (I ve not tested the code):

    # check for duplicates?
    if check_for_dups:
        n_Adups = np.count_nonzero(tmpA[1:] == tmpA[:-1])
        n_Bdups = np.count_nonzero(tmpB[1:] == tmpB[:-1])
        msg = []
        if n_Adups > 0:
            msg.append("A has {} duplicates".format(n_Adups))
        if n_Bdups > 0:
            msg.append("B has {} duplicates".format(n_Bdups))
        if len(msg) > 0:
            msg = "; ".join(msg)
            log.error(msg)
            raise ValueError(msg)

that way, it will take advantage of the already computed tmpA and tmpB. from a check on a nersc logging node, (tmpA[1:] == tmpA[:-1]).sum() takes <0.1s (resp. <1s) for a 1e8 (resp. 1e9) elements array.

if that s confirmed, I would argue to default to check_for_dups=True, because:

for fun, a small check I did to find out that np.count_nonzero(tmpA[1:] == tmpA[:-1]) seems to be 30% faster than (tmpA[1:] == tmpA[:-1]).sum(); and also that both these methods significantly outperformed the initially suggested np.unique(A).size != len(A):

for i in range(5, 10):

    x = np.arange(int(10 ** i))

    start = time()
    n0 = np.count_nonzero(x[1:] == x[:-1])
    dt0 = time() - start

    start = time()
    n1 = (x[1:] == x[:-1]).sum()
    dt1 = time() - start

    start = time()
    _ = np.unique(x).size == len(x)
    dt2 = time() - start
    #
    assert n0 == n1
    print("{}\t{:.3f}s\t{:.3f}s\t{:.3f}s".format(i, dt0, dt1, dt2))

returns

# METHOD0: np.count_nonzero()
# METHOD1: (x[1:] == x[:-1])
# METHOD2: np.unique(x).size != len(x)
#
# EXPONENT METHOD0 METHOD1 METHOD2
5   0.000s  0.000s  0.001s
6   0.001s  0.001s  0.014s
7   0.008s  0.009s  0.172s
8   0.084s  0.155s  1.734s
9   0.805s  1.200s  18.823s
geordie666 commented 1 week ago

Addressed in #828.