computationalgeography / lue

LUE Scientific Database and Environmental Modelling Framework
https://lue.computationalgeography.org
MIT License
12 stars 4 forks source link

Optimize `clump` #656

Open kordejong opened 2 months ago

kordejong commented 2 months ago

Clump contains a serial step to stitch local clumps, determined in parallel, together. Part of this serial steps is the most expensive step of the whole algorithm, and it prevents good performance and scalability. Revisit the code and try to make this step less expensive.

kordejong commented 2 months ago

Use the fact that when comparing checking whether a collection of global clump IDs is shared / overlaps with the collection used in neighbouring partitions, we can stop comparing once there is not overlap. We don't have to compare each collection with each other collection. More distant collections are more likely to not contain clumps that should be merged with clumps in the current partition. Strategy is to decrease the number of times sets need to be compared.