Faster processing of the exclusion list

marcelm commented 1 month ago

In summary, processing a long exclusion list (100000 entries) now takes a couple of seconds instead of days.

Previously, we would compare every detected clone id to every clone id in the exclusion list. Each comparison uses the is_similar() function, which implements the comparison by looking at each character of the clone IDs in turn and comparing them.

So if we have m detected clone IDs, n clone IDs in the exclusion list and clone IDs have a length of k, runtime would be O(mnk).

I initially thought that the comparison to each excluded clone ID is necessary because is_similar allows for approximate matches. However, it turns out that is_similar is called with max_hamming=0, that is, only exact matches are allowed anyway, so it is equivalent to just look up each detected clone ID in a set that contains the excluded clone IDs. Runtime becomes effectively O(m) (kind of, if we consider the C-level operations to take constant time).

marcelm commented 1 month ago

Merging this as it is an urgent fix. The test failure does not have to do anything with the change at hand, and we can take of it later.

marcelm commented 1 month ago

The test failure was relevant, there is actually a difference in results. Will investigate.

marcelm commented 1 month ago

The difference in results is alleviated by the additional changes in #72. (Which makes the processing of the exclusion list a bit slower again, but it is still just a couple of minutes.)

frisen-lab / TREX

Faster processing of the exclusion list #71