Diff performance has been a gap as of late. The changes made in #356 were a little too greedy and resulted in calculating the Levenshtein distance and creating reports for files we don't care to diff.
I was able to diff Spark and generate a profile for one of the earlier iterations of the code in this PR and the levenshtein.Calculate method consuming most of the CPU time:
Additionally, the concurrency wasn't implemented correctly which caused a lot of performance issues.
This PR re-adds the filtering of paths and reworks the concurrency used when creating the flatted reports as well as adding concurrency for each of the scans (part of the long execution time was the serial creation of the file reports which will now happen concurrently).
Prior to this change, running a diff on the fully-extracted Spark files would not finish; now the same scan takes ~15 minutes (with two maps consisting of 759,999 and 759,993 elements, respectively):
________________________________________________________
Executed in 898.07 secs fish external
usr time 39.35 mins 0.10 millis 39.35 mins
sys time 39.94 mins 3.66 millis 39.94 mins
Closes: #426
Diff performance has been a gap as of late. The changes made in #356 were a little too greedy and resulted in calculating the Levenshtein distance and creating reports for files we don't care to diff.
Additionally, the concurrency wasn't implemented correctly which caused a lot of performance issues.
This PR re-adds the filtering of paths and reworks the concurrency used when creating the flatted reports as well as adding concurrency for each of the scans (part of the long execution time was the serial creation of the file reports which will now happen concurrently).
Prior to this change, running a diff on the fully-extracted Spark files would not finish; now the same scan takes ~15 minutes (with two maps consisting of 759,999 and 759,993 elements, respectively):