Open boyter opened 6 years ago
Benchmark #1: lc -pbl .git,vendor,licenses -f tabular .
Time (mean ± σ): 3.425 s ± 0.053 s [User: 3.617 s, System: 0.122 s]
Range (min … max): 3.375 s … 3.521 s
Example of performance as it currently stands.
Initial tests did not look so good. Just adding the fan out to the work was not brilliant as it produced no speed improvements.
Checking against a single file shows that the initial pass takes ~600ms so there is a lot to be gained there as well.
Probably need to consider changing how the whole pipeline works to achieve a speedup here.
(pprof) top10
Showing nodes accounting for 22200ms, 60.77% of 36530ms total
Dropped 205 nodes (cum <= 182.65ms)
Showing top 10 nodes out of 79
flat flat% sum% cum cum%
6930ms 18.97% 18.97% 6930ms 18.97% runtime.indexbytebody
5490ms 15.03% 34.00% 16370ms 44.81% strings.Index
1630ms 4.46% 38.46% 1630ms 4.46% runtime.memeqbody
1500ms 4.11% 42.57% 2800ms 7.66% runtime.slicerunetostring
1280ms 3.50% 46.07% 1280ms 3.50% runtime.encoderune
1170ms 3.20% 49.27% 1560ms 4.27% runtime.mapiternext
1150ms 3.15% 52.42% 2240ms 6.13% runtime.mapaccess2_faststr
1070ms 2.93% 55.35% 1070ms 2.93% runtime.aeshashbody
1050ms 2.87% 58.23% 1920ms 5.26% runtime.scanobject
930ms 2.55% 60.77% 930ms 2.55% runtime.memequal
Most of the time is spent in contains and index comparisons. Might be faster to move over to byte comparisons.
https://blog.sourced.tech/post/gld/
Called out publicly... oh its on now. Time to double down on performance.
Although one of their goals is to
"Favor false positives over false negatives (target data mining instead of compliance)."
Which I did not want to do.
http://web.archive.org/web/20180904032703/https://blog.sourced.tech/post/gld/
updated link because they went away
Although it seems it lives on somewhat here https://github.com/go-enry/go-license-detector
https://github.com/src-d/go-license-detector/blob/master/licensedb/dataset.zip link to file for testing
The performance could be a lot better through the use of fan out. Might be possible to speed up the matching as well by using byte comparisons rather than string. Need to investigate both as the tool can be quite slow at times.