dendrograms / astrodendro

Generate a dendrogram from a dataset
https://dendrograms.readthedocs.io/
Other
37 stars 36 forks source link

min_delta fails on "flat-topped" leaves with no parent structure #176

Open jbaade opened 2 years ago

jbaade commented 2 years ago

I noticed some unexpected behavior while testing dendrograms on random 1D spectra, to learn about how the algorithm worked. The file given is just an array of Gaussian noise with stdev 1, with a Gaussian function added with mean 50, amplitude 6, and stdev 2. I made a large number of varying spectra and collected examples of failed detections; some of them were too broken up by noise, which makes sense, but the rest had this sort of profile: bad_spectrum

bad_spectrum_file.txt

Loading that file into an array and running Dendrogram.compute(spectrum, min_value=3, min_delta=1, min_npix=3), I get an empty dendrogram, because only one structure is detected at all, and it fails the independence test for min_delta. Lines 84-91 in pruning.py:

    def result(structure, index=None, value=None):
        if value is None:
            if structure.parent is not None:
                return (structure.height - structure.parent.height) >= delta

            return (structure.vmax - structure.vmin) >= delta
        return (structure.vmax - value) >= delta
    return result

Here, structure.parent is None because the collection of points above min_value=3, at x=48 through x=51, forms the only structure, so line 89 compares the highest (5.25) and lowest (5.03) value in the structure and returns False.

I might have misunderstood the algorithm example in the documentation, but why this line compares vmax and vmin isn't very clear to me. My understanding was that the purpose of min_delta was to set a cutoff for distinguishing a substructure from noise. Why would a large-enough structure far above the min_value threshold not ought to be counted if it happens to be alone and to have a plateauing profile? This is just the sort of shape that can appear upon sampling a noisy Gaussian; I expect it might occur in the 2D case as well as 1D. Is this behavior intentional, and if so, what's the reasoning behind it?

keflavich commented 2 years ago

I think you're right, this sort of object should be included, and I think you've pinpointed a line that incorrectly excludes it. This warrants a deeper review. I'd like to hear from others whether there's any good explanation for this.

I also suggest we add this example to the test suite as a regression test.