boyter / lc

licensechecker (lc) a command line application which scans directories and identifies what software license things are under producing reports as either SPDX, CSV, JSON, XLSX or CLI Tabular output. Dual-licensed under MIT or the UNLICENSE.
GNU Affero General Public License v3.0
124 stars 17 forks source link

Performance #34

Open boyter opened 6 years ago

boyter commented 6 years ago

The performance could be a lot better through the use of fan out. Might be possible to speed up the matching as well by using byte comparisons rather than string. Need to investigate both as the tool can be quite slow at times.

boyter commented 6 years ago
Benchmark #1: lc -pbl .git,vendor,licenses -f tabular .

  Time (mean ± σ):      3.425 s ±  0.053 s    [User: 3.617 s, System: 0.122 s]

  Range (min … max):    3.375 s …  3.521 s

Example of performance as it currently stands.

boyter commented 6 years ago

Initial tests did not look so good. Just adding the fan out to the work was not brilliant as it produced no speed improvements.

Checking against a single file shows that the initial pass takes ~600ms so there is a lot to be gained there as well.

Probably need to consider changing how the whole pipeline works to achieve a speedup here.

boyter commented 6 years ago
(pprof) top10
Showing nodes accounting for 22200ms, 60.77% of 36530ms total
Dropped 205 nodes (cum <= 182.65ms)
Showing top 10 nodes out of 79
      flat  flat%   sum%        cum   cum%
    6930ms 18.97% 18.97%     6930ms 18.97%  runtime.indexbytebody
    5490ms 15.03% 34.00%    16370ms 44.81%  strings.Index
    1630ms  4.46% 38.46%     1630ms  4.46%  runtime.memeqbody
    1500ms  4.11% 42.57%     2800ms  7.66%  runtime.slicerunetostring
    1280ms  3.50% 46.07%     1280ms  3.50%  runtime.encoderune
    1170ms  3.20% 49.27%     1560ms  4.27%  runtime.mapiternext
    1150ms  3.15% 52.42%     2240ms  6.13%  runtime.mapaccess2_faststr
    1070ms  2.93% 55.35%     1070ms  2.93%  runtime.aeshashbody
    1050ms  2.87% 58.23%     1920ms  5.26%  runtime.scanobject
     930ms  2.55% 60.77%      930ms  2.55%  runtime.memequal

Most of the time is spent in contains and index comparisons. Might be faster to move over to byte comparisons.

boyter commented 6 years ago

https://blog.sourced.tech/post/gld/

Called out publicly... oh its on now. Time to double down on performance.

boyter commented 6 years ago

Although one of their goals is to

"Favor false positives over false negatives (target data mining instead of compliance)."

Which I did not want to do.

boyter commented 3 years ago

http://web.archive.org/web/20180904032703/https://blog.sourced.tech/post/gld/

updated link because they went away

boyter commented 3 years ago

Although it seems it lives on somewhat here https://github.com/go-enry/go-license-detector

boyter commented 3 years ago

https://github.com/src-d/go-license-detector/blob/master/licensedb/dataset.zip link to file for testing