Levenshtein distance calculation is slow

Inventitech commented 8 years ago

Bug report sent in by a WatchDog user:

I'm afraid I've got a new one though... The method com.cedarsoftware.util.StringUtilities.levenshteinDistance() is taking far far more time than is reasonable, and is regularly pushing all cores to 100%, generally killing my general development experience. I've restarted IntelliJ and the machine itself, and cleared intelliJ's caches, but no effect.

Profiling IntelliJ for about 10 seconds of moderate file manipulation shows 250+ seconds of cpu time spent in that method. I don't have 25 cores, so I'm not sure what that means, perhaps the sampler is playing fast and loose with timings.

It is called by nl.tudelft.watchdog.core.logic.interval.intervaltypes.TypingInterval$TypingIntervalCloserBase.run(), which itself is called by Thread.run(), and usually pegs all cores for about a minute or two at least before calming down again.

Attached is the CPU profiling session from jvisualvm, in the hopes that it sheds some light on the issue. I've been hunting this bug in my workspace for much of the afternoon, and unfortunately need to disable this plugin until this issue is resolved.

Sorry to trade one fix for another bug report, but I'm certain that we're all committed to less buggy software, and the only way to get there is from more bug reports!

nspruit commented 8 years ago

I've taken a look at this issue, but I've not been able to recreate this scenario on my machine. For me, the Levenshtein calculation took less than 2 seconds in all scenarios I tested. Furthermore, my cores are not busy for 1-2 minutes as described above, but only a little during the calculation itself. In addition, the profiling session that was attached to the bug report doesn't indicate a potential fix, as it only indicates that the method com.cedarsoftware.util.StringUtilities.levenshteinDistance takes a lot of CPU time (>16 seconds).

However, I have a suggestion that might fix the issue. Currently, the Levenshtein calculation is performed by the java-util library. Although the version that is used is a bit old, the changelogs of newer versions don't describe changes to this particular calculation, so updating this library probably won't fix the problem. However, the project also uses the Apache commons-lang library. This library also contains the functionality we need for calculating the Levenshtein distance in StringUtils. Therefore, we might reuse this library instead and in addition remove a dependency. However, I'm sure if the java-utillibrary was chosen for a particular reason, but @Inventitech probably knows the answer to this.

Inventitech commented 8 years ago

I've thought a bit about this. It seemed to me that we could have two problems: 1) Our library is slow. 2) We have very large files or edits. The algorithm for the Levenshtein distance has O(n*m) for two documents of length (in characters!) n and m. This is not a problem for small-ish files, but does not scale very well. The question is where the practical limit to "scale very well" is reached, and whether we see such files in our observation at all.

1) Seemed not very likely to me, also since our library actually uses multithreading. 2) To this end, I investigated the data of the user who filed this issue. Because we have these fields as sub-documents in mongodb, I could not directly access them over an aggregate. Instead, I wrote a small js script that finds all change-documents for said user and prints the sloc and lsdist in a csv.

print("filetype,o_sloc,e_sloc,diff");
db.intervals.find({"projectId":"id","it":"ty"}).forEach(
    function(interval) {
        print(interval.doc.dt+","+interval.doc.sloc+","+interval.endingDocument.sloc+","+interval.diff);
    }
);

I then feed the script to mongodb and have it execute the script; I pipe the results to a csv. Reading in the result with R, sure enough max(sloc)=36,735 and max(diff)=1,048,576. The good part is that these are probably some configuration files; and the longest (Java) production file is 'only' 2400 sloc. However, there are also quite a number of "undefined" values in the dataset.

I think that identifies the problem pretty clearly. The next thing is to come up with a good practical solution to counteract it.

nspruit commented 8 years ago

That identifies the problem clearly indeed. When I examined the results in the csv file, it seems to me that the values of the 'diff' (=the Levenshtein distance) are not always correct (in addition to the "undefined" values). For instance, consider the following tuples:

filetype	o_sloc	e_sloc	diff
un	0	14198	1048576
un	0	2	6621
pr	178	173	406

In these cases, the value of 'diff' does not seem to be correct for the distance between the starting and ending document, as the Levenshtein distance is at most the length of the longer string, which is not the case in the examples above.

Inventitech commented 8 years ago

The diff is character based, the other entries are line numbers :-)

So basically, what this means is for row 1, the document started blank, then 14,198 lines were added (probably copied from somewhere else), and the number of edit steps is 1048576.

nspruit commented 8 years ago

Ok, that explains a lot

nspruit commented 8 years ago

Sure, while running WatchDog I assume?

Inventitech commented 8 years ago

Yup, although it might be easier to test with simply some unit tests (and have quite the same effect).

nspruit commented 8 years ago

1a) I've investigated whether we could do something with the events we receive from the IDE's Editor to keep track of (an approximation of) the Levenshtein distance. I discovered that it is possible to track the number of characters being added, removed or modified in both IDEs as required for the Levenshtein distance. When these events do not overlap, summing up the number of chars added, removed or modified by the events results in exactly the same value as the Levenshtein distance between the original document and the document with all events applied.

However, when one or more of these events overlap (e.g. when you remove something you added earlier in the same typing interval), the described sum will be higher than the corresponding Levenshtein distance. When you replace the entire content of a file, the sum can even be about twice the Levenshtein distance. So, the question is whether this sum is an acceptable approximation of the Levenshtein distance or not.

1b) Alternatively, I believe you can also keep track of the exact value of the Levenshtein distance during a typing interval by correctly handling overlapping events using the offset values of the events. However, this requires quite complex bookkeeping operations, which might be too computationally intensive or just too complex to implement.

I've also tested what happens when we switch to another implementation of the Levenshtein distance as described above by examining the run time and by watching the CPU load using htop. I've performed 3 runs per implementation on a file of about 3000 LOC where I added or removed 11 characters. The current implementation is multi-threaded and uses all cores. The load on each core during the calculation varies between 5-90%. The run times for this implementation are:

INFO: Levenshtein distance: 11; Elapsed time: 262762 ms
INFO: Levenshtein distance: 11; Elapsed time: 255425 ms
INFO: Levenshtein distance: 11; Elapsed time: 259077 ms

The implementation in the Apache commons-lang library uses only a single thread of which the load remains 100% during most of the calculation. The run times are as follows:

INFO: Levenshtein distance: 11; Elapsed time: 93336 ms
INFO: Levenshtein distance: 11; Elapsed time: 94795 ms
INFO: Levenshtein distance: 11; Elapsed time: 87214 ms

2) So, it seems like the Apache implementation is faster than the currently used one. Although the load on one core is 100%, it might be better than using all cores with varying load and many context switches, especially when you continue to use you laptop/PC during the calculation (as most machines nowadays have >1 core). Therefore, switching to the Apache implementation might also reduce the impact of the Levenshtein calculation on the usabiltiy of the IDE.

3) Finally, as we've discussed, we could also find a maximum value of the product N*M (N=#chars in starting doc, M=#chars in ending doc) for which we are going to calculate the Levenshtein distance. For the other cases, we can introduce a simpler, less precise metric indicating the difference between the two documents, e.g. |N - M|.

Inventitech commented 8 years ago

Good! I was so free as to update your comment and introduce a numbering scheme to make it easier for me to reply.

It would be great if you could do 1a), 2) and 3). Regarding 3), I'd suggest to always do this rough metric anyway (so that we can correlate it with the Levenshtein distance). I think N*M probably is too simplistic because I guess the calculation will also take long if either N=0 or M=0, and the other one is very large (but we'd have to test for this).

nspruit commented 8 years ago

Alright, do you want me to do these three things in a single branch/PR?

Regarding your comment, we might also take just N or M when M=0 or N=0, respectively. Then, we won't have 0 values for the product while one of the files might still be extremely large.

Inventitech commented 8 years ago

That's up to you.

nspruit commented 8 years ago

Ok, then I'll do it in a single branch with multiple commits (at least one per suggestion), so we can revert the changes for a particular suggestion if we have to.

Inventitech commented 8 years ago

Hi Niels,

Can you measure the time a bit with different LOCs? So for example 10, 50, 100, 250, 500, 1000 and 2000? Thx!

Moritz

nspruit commented 8 years ago

I've created a program (outside of the project) to do exactly that as well as to report the product N*M, so we can define a practical threshold for deciding whether or not to calculate the Levenshtein distance based on this product. However, I still have to generate several random test files for the experiment. I'll let you know when that is done and I have the results.

Inventitech commented 8 years ago

:+1:

TestRoots / watchdog

Levenshtein distance calculation is slow #224