LBL-EESA / TECA

TECA, theToolkit for Extreme Climate Analysis, contains a collection of climate anlysis algorithms targetted at extreme event detection and analysis.
Other
53 stars 21 forks source link

connected component bug -- one too many objects counted #404

Closed taobrienlbl closed 4 years ago

taobrienlbl commented 4 years ago

See PR #403.

Connected components currently counts one too many objects. Analysis of output meshes indicates that it counts the background as a connected component. The first commit in this branch adds a test test/python/test_connected_components_simple.py demonstrating this on a simple test case.

@burlen - could you please verify that this test is valid and that the failure indeed indicates a bug?

burlen commented 4 years ago

The context from our discussions is below:

TAO's comment:

After discussing this in detail, here is our initial proposal:

  • Undo commit 01a7a90 in cc_fix branch
  • Improve documentation about the connected components algorithm (in header and RTD)
  • Rename metadata variable number_of_components to number_of_labels
  • Add simple test to ensure that segmentation works as expected if first value of array is not background
  • Check whether ar_count in teca_bayesian_ar_detect is reporting AR counts correctly (or whether bg is included)
  • Consider further how background is/should-be handled with area filters (see skimage.measure)

BL's email:

On 7/15/20 9:49 AM, Burlen Loring wrote:

Hi Travis,

I've looked at the changes and thought on this over night. I'd like to look at the existing code in more detail too but am running out of time before the meeting. My inclination is that this is not a bug and should not be changed. However, I'm not sure I understand exactly why you want to make this change, and you've consistently caught and fixed bugs in the code, and hence I want to discuss it more with you in person/zoom before deciding.

I'm not convinced yet that changing the label from 0 to -1 affects an algorithmically meaningful change. Since we are consistent in using 0 as the label for the back ground, changing it to -1 doesn't change the behavior of the algorithm.

Changing from 0 to -1 breaks some optimizations that we built in to subsequent processing stages, where fast O(c) look ups can be made where consecutive labels starting from 0 are present. See comments in teca_2d_component_area.h. Note: we did plan for non-ordinal labels, implementing a slow algorithm where O(log n) searches are made at every look up.Change to -1 takes the possibility of the fast O(c) look up optimizations completely off the table. reluctant to do it for that reason.

Thinking beyond the Bayesian AR detector use case, it is valuable to have the area of the back ground computed. This lets one get the total area by summing the areas of all the component. In fact one of our regression tests is based on this concept. In other applications the back ground may be a necessary/useful feature. For instance one could imagine that metric based on ratios involving back ground area could be a useful. For those reasons I'd advise not to skip computing the area of the back ground.

In algorithms where the back ground should not be used, since we are consistent in that 0th label is always back ground, it is simple to skip it. For n labels, start processing at label 1 and process to label n-1. This seems to be the way to go. It's what I intended when developing the labeler and label processing algorithms.

If you have found some instances where we are being inconsistent with background being labeled 0, or it is hard to skip for some reason, I think we should start the discussion there and see what can be done to make sure that we are consistent and make that process easier.

I want to say we handled this correctly in the Bayesian AR detector, but I'd like to double check to be sure, but am running out of time before the meeting...

Let's talk more about this in the meeting, in fact I'll jump online now in case you want to gets started early Burlen

On 7/14/20 5:48 PM, O'Brien, Travis Allen wrote:

Hi Burlen,

Thanks for checking this out on short notice.

The back ground is always labeled 0, this is by design. Could you not start processing at component 1 and go up? We could do that, yes. I do think something should change in the code: the code that reports ’number_of_components’ counts the component labeled as 0 (background), and the area calculation also calculates the area of component 0.

And yes, meeting tomorrow at 10am would work. If you have a few minutes before our meeting, would you mind taking a look at the proposed change that I pushed after I sent this first e-mail? It is a minimal change that redefines the background label to -1 and avoids reporting it as a counted component.

Have a good evening, -Travis-

-- Travis A. O'Brien Pronouns: he/him

Assistant Professor Earth and Atmospheric Sciences Indiana University, Bloomington

Visiting Faculty Climate and Ecosystem Sciences Division Lawrence Berkeley Lab

https://earth.indiana.edu/directory/faculty/obrien-travis.html +1 (812) 269-2051

On July 14, 2020 at 7:49:31 PM, Burlen Loring (bloring@lbl.gov) wrote:

Hi Travis,

The back ground is always labeled 0, this is by design. Could you not start processing at component 1 and go up?

I moved the weekly meeting to tomorrow at 10 am, perhaps we can figure it out then?

Burlen

On 7/14/20 3:59 PM, O'Brien, Travis Allen wrote:

Hi Burlen,

I believe I just discoverd a serious bug in the connected component algorithm: https://github.com/LBL-EESA/TECA/pull/403

This impacts the TECA BARD paper revision that is due next Friday. Could you please check this? If this is a bug, we’ll need to fix this ASAP, since I’ll need to re-run the TECA BARD training before we can proceed with resubmitting the paper.

Thanks, -Travis-

-- Travis A. O'Brien Pronouns: he/him

Assistant Professor Earth and Atmospheric Sciences Indiana University, Bloomington

Visiting Faculty Climate and Ecosystem Sciences Division Lawrence Berkeley Lab

https://earth.indiana.edu/directory/faculty/obrien-travis.html +1 (812) 269-2051