evaluating scancode - Githubissues

valeriocos commented 5 years ago

Hi @pombredanne,

I've embedded scancode (a really nice tool) in Graal and now I'm evaluating scancode against nomos (another popular tool for license analysis) wrt precision and performance.

In a nutshell, the evaluation consists in iterating over the commits of a set of git repositories, for each commit graal performs a checkout and launches scancode/nomos on each file present in the commit, finally the results are persisted on disk. While nomos is pretty fast (it processed 5 repos of around 3000 commits each in 2 hours), scancode is stilll processing the first repo. I'm wondering if I'm missing some parameters (or if you have some suggestions) to make the analysis faster. Currently I'm using the release 3.0.0 and I launch it with the following params: https://github.com/chaoss/grimoirelab-graal/blob/master/graal/backends/core/analyzers/scancode.py#L58

Thank you

pombredanne commented 5 years ago

@valeriocos Hi! it was nice to meet at FOSDEM and thank you for the report. You want to use 3.0.2 but that minor. You would likely to run on multiple processes with --processes X where X would the number of parallel processes. There could be some other flags too Can you tell me some examples of what you scan too? A whole checkout for each commit? or something else? And is this something you run on one file at a time?

valeriocos commented 5 years ago

Thank you for answering @pombredanne and indeed It was nice to meet you too at FOSDEM. I'm going to try with the version 3.0.2 and the --processes param and let you know.

Scancode is currently being executed on https://github.com/xiph/vorbis, graal performs a whole checkout of the commit, and then scancode is launched for each file in the commit (one file at a time)

if you want to execute it in your machine, you can install graal and then execute from command line:

graal colic 
https://github.com/xiph/vorbis (the URL of the repo)
--git-path /home/test-scancode-vorbis (where the repo is going to be downloaded)
--exec-path /home/scancode-toolkit-3.0.0/scancode (the exec path of scancode)
--category code_license_scancode (the category to activate scancode)

pombredanne commented 5 years ago

@valeriocos there are two issues:

there some pathological files in vorbis (and that's a bug tracked in #1404)
running a scan file-by-file is the worst case scenario for ScanCode.

Now if I run the quick comparison where I run not file-by-file but a directory at a time, and both using 8 processes:

$ time ~/w421/scancode-toolkit-master2/scancode --license vorbis-9eadeccdc4247127d91ac70555074239f5ce3529 -n8 --json-pp sc.json
Setup plugins...
Collect file inventory...
Scan files for: licenses with 8 process(es)...
[####################] 413                                          
Scanning done.
Summary:        licenses with 8 process(es)
Errors count:   0
Scan Speed:     13.29 files/sec. 
Initial counts: 453 resource(s): 413 file(s) and 40 directorie(s) 
Final counts:   453 resource(s): 413 file(s) and 40 directorie(s) 
Timings:
  scan_start: 2019-02-28T140649.087502
  scan_end:   2019-02-28T140722.627997
  setup_scan:licenses: 2.33s
  setup: 2.33s
  scan: 31.07s
  total: 33.62s
Removing temporary files...done.

real    0m34.809s
user    2m41.974s
sys 0m2.723s

and :

$ time ./nomossa -d ~/tmp/bnch/vorbis-9eadeccdc4247127d91ac70555074239f5ce3529 -n8  > out.txt

real    0m3.184s
user    0m20.022s
sys 0m0.076s

So we have 34.809s vs. 3.184s which is about 11 times slower in this case (which is still too slow and #1400 would help). In practice because of bugs nomossa should only be running on a single thread to avoid munging the output. In this case the elapsed time is 6.1s which means this is still about 5 times faster.

If you were to run file by file, it would take about 18s vs. well ... ~39 minutes? There is a fixed startup time of about ~2s to load the license index but some overhead: everything in ScanCode is designed more for batch rather one-at-a-time.

Now if you run on Python 2, you could directly invoke the license detection as a function call on one file at a time. There, the overhead of the index load would happen only once. You would also bypass any JSON serialization/deserialization.

Of course, the bug I mentioned in #1404 in also making this more problematic but I happen to a work in progress fix in a local branch

The other thing to consider is the accuracy of detection (it does not help to be fast if this is to be incorrect). A quick comparison shows a few false positive GPL and several imprecise detections in nomos. I will post a comparison table tomorrow

pombredanne commented 5 years ago

then scancode is launched for each file in the commit (one file at a time)

the short answer is that this is the worst case for ScanCode (one file at a time)

pombredanne commented 5 years ago

@valeriocos are you dealing with only the files changed in a commit ? or all the files are rescanned at every commit? because there is a new feature being worked on to provide multiple paths as args to ScanCode in #1399 based on a report by @nicobucher which seems quite related.

valeriocos commented 5 years ago

thank you @pombredanne for your detailed explanation. I'm dealing with only the files changed in a commit. I'll have a look at #1399 and try to use it.

Now if you run on Python 2, you could directly invoke the license detection as a function call on one file at a time. There, the overhead of the index load would happen only once. You would also bypass any JSON serialization/deserialization.

Unfortunately I'm using Python 3, thus scancode is executed by command line.

The other thing to consider is the accuracy of detection (it does not help to be fast if this is to be incorrect). A quick comparison shows a few false positive GPL and several imprecise detections in nomos. I will post a comparison table tomorrow

I attach the results for nomos and scancode obtained from the analysis of the vorbis repo. In a nutshell, each line in the files is a JSON representing the analysis for a given commit. In the attribute data.analysis you will find the output of scancode/nomos.

vorbis-analysis-nomos-vs-scancode.zip

(other repos are under analysis, once they are done I can share them with you if you want)

pombredanne commented 5 years ago

ok... so there is something that @armijnhemel was asking me about a while back which would be a way to have a pre-fork daemon of sorts such that there is always a pre-loaded process ready to scan.

@armijnhemel how would you do this?

Alternatively scanning a list of paths may amortize the startup costs too. And I have some fixes for the #1404 issues that cuts down the scan time in half in some special cases (code files with most numeric data)

armijnhemel commented 5 years ago

I would agree with what @pombredanne says: running scancode file by file is the worst case scenario. What I could imagine is something similar to clamd where you can send a path and a persistent process scans that path.

jgbarah commented 5 years ago

Can scancode analyze lists of files (in a single invocation, I mean). Maybe that could speed up the thing, since each commit may touch several files, and in that case, running would be sort of linear with the number of commits instead of number of files touched for all commits. (I think this is what you mention above as "list of paths").

On a related note, do you have plans to support Python3 in the near future? That could make things easier too...

pombredanne commented 5 years ago

@jgbarah Hey!

Can scancode analyze lists of files (in a single invocation, I mean). Maybe that could speed up the thing, since each commit may touch several files, and in that case, running would be sort of linear with the number of commits instead of number of files touched for all commits. (I think this is what you mention above as "list of paths").

Not yet but I started a branch that can do that now at https://github.com/nexB/scancode-toolkit/pull/1399 and this supports doing things such as git diff --name-only master | xargs scancode -i --json-pp - and passing multiple paths in one call... it should soon land in develop See also #875 and #1397 ... this last one by @nicobucher is kinda timely with your evaluation.

On a related note, do you have plans to support Python3 in the near future? That could make things easier too...

Yes, see #295 ... this is going to be a GSoC project

I also have a simple remoting solution using execnet until then and this works nicely: I will push it later today for your enjoyment.

pombredanne commented 5 years ago

@valeriocos re

I attach the results for nomos and scancode obtained from the analysis of the vorbis repo. In a nutshell, each line in the files is a JSON representing the analysis for a given commit. In the attribute data.analysis you will find the output of scancode/nomos.

Thank you! I will post my eval later.

pombredanne commented 5 years ago

@valeriocos do you mind trying this https://github.com/nexB/scancode-toolkit/commit/8afa686fb71b9540029234e5a40c0572c4457c28 in the 1397-multiple-inputs branch? There is a README

valeriocos commented 5 years ago

thank you @pombredanne for this, I'll use it and report.

pombredanne commented 5 years ago

@valeriocos sure, note that the code has now been merged (in the develop branch)

valeriocos commented 5 years ago

Thank you @pombredanne and sorry for the late reply. I have prepared a branch to test scancli.py (https://github.com/valeriocos/grimoirelab-graal/blob/test-scancli/graal/backends/core/analyzers/scancode.py#L48). I'll execute some tests and report on the performance.

pombredanne commented 5 years ago

@valeriocos ping. How things are working out for you?

valeriocos commented 5 years ago

Hi @pombredanne , sorry for the late reply. I tried the new version of scancode and it's far faster than the previous one. For instance, for the following repo: https://github.com/xiph/vorbis

scancode 3.0.0 took 20 hours, 52 minutes and 39 seconds
scancli took 3 hours, 36 minutes and 38 seconds

I inspected some of the results and in some cases I see some differences (I guess this is due to some improvements in scancode itself). For instance, for the repo https://github.com/xiph/vorbis and the commit git show 0695c7cbf5d766b7db3c664fa1bb82531c71fa38 I see that:

scancode 3.0.0 identified only one license TU Berlin License 1.0 (see below an excerpt of the data)

    "commit": "0695c7cbf5d766b7db3c664fa1bb82531c71fa38",
    "Author": "Monty <xiphmont@xiph.org>",
    "AuthorDate": "Wed Mar 29 03:49:29 2000 +0000",
    "Commit": "Monty <xiphmont@xiph.org>",
    "CommitDate": "Wed Mar 29 03:49:29 2000 +0000",
    "message": "Don't want to lose anything while I'm integrating (also don;t want to\ndisturb mainline till I'm done)\n\nMonty\n\nsvn path=\/branches\/unlabeled-1.18.2\/vorbis\/; revision=286",
    "analysis": [
      {
        "licenses": [
          {
            "key": "tu-berlin",
            "score": 98.9,
            "name": "Technische Universitaet Berlin Attribution License 1.0",
            "short_name": "TU Berlin License 1.0",
            "category": "Permissive",
            "is_exception": false,
            "owner": "Technische Universitaet Berlin",
            "homepage_url": "https:\/\/github.com\/swh\/ladspa\/blob\/7bf6f3799fdba70fda297c2d8fd9f526803d9680\/gsm\/COPYRIGHT",
            "text_url": "",
            "reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:tu-berlin",
            "spdx_license_key": "TU-Berlin-1.0",
            "spdx_url": "https:\/\/spdx.org\/licenses\/TU-Berlin-1.0",
            "start_line": 30,
            "end_line": 39,
            "matched_rule": {
              "identifier": "tu-berlin.LICENSE",
              "license_expression": "tu-berlin",
              "licenses": [
                "tu-berlin"
              ],
              "is_license_text": true,
              "is_license_notice": false,
              "is_license_reference": false,
              "is_license_tag": false
            }
          }
        ],
        "file_path": "lib\/lpc.c"
      }
    ],
    "analyzer": "scancode"

scancli identified two licenses GPL-2.0-only and TU Berlin License 1.0 (see below an excerpt of the data). A manual inspection points out that this result is more precise than the one above.

    "commit": "0695c7cbf5d766b7db3c664fa1bb82531c71fa38",
    "Author": "Monty <xiphmont@xiph.org>",
    "AuthorDate": "Wed Mar 29 03:49:29 2000 +0000",
    "Commit": "Monty <xiphmont@xiph.org>",
    "CommitDate": "Wed Mar 29 03:49:29 2000 +0000",
    "message": "Don't want to lose anything while I'm integrating (also don;t want to\ndisturb mainline till I'm done)\n\nMonty\n\nsvn path=\/branches\/unlabeled-1.18.2\/vorbis\/; revision=286",
    "analysis": {
      "licenses": [
        {
          "path": "lpc.c",
          "type": "file",
          "name": "lpc.c",
          "base_name": "lpc",
          "extension": ".c",
          "size": 11008,
          "date": "2019-03-15",
          "sha1": "71398429be51a79438400d0317dd6c4ab03e97d3",
          "md5": "e206cfa46afe1ff773767b934378b14d",
          "mime_type": "text\/x-c",
          "file_type": "C source, ASCII text",
          "programming_language": "C++",
          "is_binary": false,
          "is_text": true,
          "is_archive": false,
          "is_media": false,
          "is_source": true,
          "is_script": false,
          "licenses": [
            {
              "key": "gpl-2.0",
              "score": 94.74,
              "name": "GNU General Public License 2.0",
              "short_name": "GPL 2.0",
              "category": "Copyleft",
              "is_exception": false,
              "owner": "Free Software Foundation (FSF)",
              "homepage_url": "http:\/\/www.gnu.org\/licenses\/gpl-2.0.html",
              "text_url": "http:\/\/www.gnu.org\/licenses\/gpl-2.0.txt",
              "reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:gpl-2.0",
              "spdx_license_key": "GPL-2.0-only",
              "spdx_url": "https:\/\/spdx.org\/licenses\/GPL-2.0-only",
              "start_line": 3,
              "end_line": 6,
              "matched_rule": {
                "identifier": "gpl-2.0_613.RULE",
                "license_expression": "gpl-2.0",
                "licenses": [
                  "gpl-2.0"
                ],
                "is_license_text": false,
                "is_license_notice": true,
                "is_license_reference": false,
                "is_license_tag": false,
                "matcher": "2-aho",
                "rule_length": 36,
                "matched_length": 36,
                "match_coverage": 100,
                "rule_relevance": 100
              },
              "matched_text": "THIS FILE IS PART OF THE [Ogg] [Vorbis] SOFTWARE CODEC SOURCE CODE.  *\n * USE, DISTRIBUTION AND REPRODUCTION OF THIS SOURCE IS GOVERNED BY *\n * THE GNU PUBLIC LICENSE 2, WHICH IS INCLUDED WITH THIS SOURCE.    *\n * PLEASE READ THESE TERMS DISTRIBUTING.                            *"
            },
            {
              "key": "tu-berlin",
              "score": 98.9,
              "name": "Technische Universitaet Berlin Attribution License 1.0",
              "short_name": "TU Berlin License 1.0",
              "category": "Permissive",
              "is_exception": false,
              "owner": "Technische Universitaet Berlin",
              "homepage_url": "https:\/\/github.com\/swh\/ladspa\/blob\/7bf6f3799fdba70fda297c2d8fd9f526803d9680\/gsm\/COPYRIGHT",
              "text_url": "",
              "reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:tu-berlin",
              "spdx_license_key": "TU-Berlin-1.0",
              "spdx_url": "https:\/\/spdx.org\/licenses\/TU-Berlin-1.0",
              "start_line": 30,
              "end_line": 39,
              "matched_rule": {
                "identifier": "tu-berlin.LICENSE",
                "license_expression": "tu-berlin",
                "licenses": [
                  "tu-berlin"
                ],
                "is_license_text": true,
                "is_license_notice": false,
                "is_license_reference": false,
                "is_license_tag": false,
                "matcher": "3-seq",
                "rule_length": 91,
                "matched_length": 90,
                "match_coverage": 98.9,
                "rule_relevance": 100
              },
              "matched_text": "Any use of this software is permitted provided that this notice is not\nremoved and that neither the authors nor the Technische [Universita]\"[t]\nBerlin are deemed to have made any representations as to the\nsuitability of this software for any purpose nor are held responsible\nfor any defects of this software. THERE IS ABSOLUTELY NO WARRANTY FOR\nTHIS SOFTWARE.\n\nAs a matter of courtesy, the authors request to be informed about uses\nthis software has found, about bugs in this software, and about any\nimprovements that may be of general interest."
            }
          ],
          "license_expressions": [
            "gpl-2.0",
            "tu-berlin"
          ],
          "holders": [
            {
              "value": "Monty and The XIPHOPHORUS Company",
              "start_line": 8,
              "end_line": 10
            },
            {
              "value": "Preserved by Jutta Degener and Carsten Bormann, Technische",
              "start_line": 25,
              "end_line": 28
            }
          ],
          "copyrights": [
            {
              "value": "(c) COPYRIGHT 1994-2000 by Monty <monty@xiph.org> and The XIPHOPHORUS Company http:\/\/www.xiph.org",
              "start_line": 8,
              "end_line": 10
            },
            {
              "value": "Preserved Copyright 1992, 1993, 1994 by Jutta Degener and Carsten Bormann, Technische",
              "start_line": 25,
              "end_line": 28
            }
          ],
          "authors": [
            {
              "value": "Jutta Degener and Carsten Bormann",
              "start_line": 20,
              "end_line": 23
            },
            {
              "value": "J. Durbin",
              "start_line": 57,
              "end_line": 57
            }
          ],
          "files_count": 0,
          "dirs_count": 0,
          "size_count": 0,
          "scan_errors": [

          ],
          "file_path": "lib\/lpc.c"
        }
      ]
    },
    "analyzer": "scancode"

If you are interested, I can share with you the complete analysis obtained via Graal and scancli for https://github.com/xiph/vorbis.

valeriocos commented 5 years ago

@pombredanne should we close this issue or there is something left to discuss? If scancli is ready/tested/stable, I'll include it in Graal in the next days.

Thanks!

pombredanne commented 5 years ago

@valeriocos I would like to make sure everything is a OK before closing this. It also needs some doc for sure!

valeriocos commented 5 years ago

Thank you for answering @pombredanne , I'll keep an eye on the issue.

pombredanne commented 5 years ago

@armijnhemel that scanserv is something you wanted too BTW ;)

aboutcode-org / scancode-toolkit

evaluating scancode #1400