Open valeriocos opened 5 years ago
@valeriocos Hi! it was nice to meet at FOSDEM and thank you for the report.
You want to use 3.0.2 but that minor. You would likely to run on multiple processes with --processes X
where X would the number of parallel processes. There could be some other flags too
Can you tell me some examples of what you scan too? A whole checkout for each commit? or something else? And is this something you run on one file at a time?
Thank you for answering @pombredanne and indeed It was nice to meet you too at FOSDEM.
I'm going to try with the version 3.0.2 and the --processes
param and let you know.
Scancode is currently being executed on https://github.com/xiph/vorbis, graal performs a whole checkout of the commit, and then scancode is launched for each file in the commit (one file at a time)
if you want to execute it in your machine, you can install graal and then execute from command line:
graal colic
https://github.com/xiph/vorbis (the URL of the repo)
--git-path /home/test-scancode-vorbis (where the repo is going to be downloaded)
--exec-path /home/scancode-toolkit-3.0.0/scancode (the exec path of scancode)
--category code_license_scancode (the category to activate scancode)
@valeriocos there are two issues:
Now if I run the quick comparison where I run not file-by-file but a directory at a time, and both using 8 processes:
$ time ~/w421/scancode-toolkit-master2/scancode --license vorbis-9eadeccdc4247127d91ac70555074239f5ce3529 -n8 --json-pp sc.json
Setup plugins...
Collect file inventory...
Scan files for: licenses with 8 process(es)...
[####################] 413
Scanning done.
Summary: licenses with 8 process(es)
Errors count: 0
Scan Speed: 13.29 files/sec.
Initial counts: 453 resource(s): 413 file(s) and 40 directorie(s)
Final counts: 453 resource(s): 413 file(s) and 40 directorie(s)
Timings:
scan_start: 2019-02-28T140649.087502
scan_end: 2019-02-28T140722.627997
setup_scan:licenses: 2.33s
setup: 2.33s
scan: 31.07s
total: 33.62s
Removing temporary files...done.
real 0m34.809s
user 2m41.974s
sys 0m2.723s
and :
$ time ./nomossa -d ~/tmp/bnch/vorbis-9eadeccdc4247127d91ac70555074239f5ce3529 -n8 > out.txt
real 0m3.184s
user 0m20.022s
sys 0m0.076s
So we have 34.809s vs. 3.184s which is about 11 times slower in this case (which is still too slow and #1400 would help). In practice because of bugs nomossa should only be running on a single thread to avoid munging the output. In this case the elapsed time is 6.1s which means this is still about 5 times faster.
If you were to run file by file, it would take about 18s vs. well ... ~39 minutes? There is a fixed startup time of about ~2s to load the license index but some overhead: everything in ScanCode is designed more for batch rather one-at-a-time.
Now if you run on Python 2, you could directly invoke the license detection as a function call on one file at a time. There, the overhead of the index load would happen only once. You would also bypass any JSON serialization/deserialization.
Of course, the bug I mentioned in #1404 in also making this more problematic but I happen to a work in progress fix in a local branch
The other thing to consider is the accuracy of detection (it does not help to be fast if this is to be incorrect). A quick comparison shows a few false positive GPL and several imprecise detections in nomos. I will post a comparison table tomorrow
then scancode is launched for each file in the commit (one file at a time)
the short answer is that this is the worst case for ScanCode (one file at a time)
@valeriocos are you dealing with only the files changed in a commit ? or all the files are rescanned at every commit? because there is a new feature being worked on to provide multiple paths as args to ScanCode in #1399 based on a report by @nicobucher which seems quite related.
thank you @pombredanne for your detailed explanation. I'm dealing with only the files changed in a commit. I'll have a look at #1399 and try to use it.
Now if you run on Python 2, you could directly invoke the license detection as a function call on one file at a time. There, the overhead of the index load would happen only once. You would also bypass any JSON serialization/deserialization.
Unfortunately I'm using Python 3, thus scancode is executed by command line.
The other thing to consider is the accuracy of detection (it does not help to be fast if this is to be incorrect). A quick comparison shows a few false positive GPL and several imprecise detections in nomos. I will post a comparison table tomorrow
I attach the results for nomos and scancode obtained from the analysis of the vorbis repo. In a nutshell, each line in the files is a JSON representing the analysis for a given commit. In the attribute data.analysis you will find the output of scancode/nomos.
vorbis-analysis-nomos-vs-scancode.zip
(other repos are under analysis, once they are done I can share them with you if you want)
ok... so there is something that @armijnhemel was asking me about a while back which would be a way to have a pre-fork daemon of sorts such that there is always a pre-loaded process ready to scan.
@armijnhemel how would you do this?
Alternatively scanning a list of paths may amortize the startup costs too. And I have some fixes for the #1404 issues that cuts down the scan time in half in some special cases (code files with most numeric data)
I would agree with what @pombredanne says: running scancode file by file is the worst case scenario. What I could imagine is something similar to clamd where you can send a path and a persistent process scans that path.
Can scancode analyze lists of files (in a single invocation, I mean). Maybe that could speed up the thing, since each commit may touch several files, and in that case, running would be sort of linear with the number of commits instead of number of files touched for all commits. (I think this is what you mention above as "list of paths").
On a related note, do you have plans to support Python3 in the near future? That could make things easier too...
@jgbarah Hey!
Can scancode analyze lists of files (in a single invocation, I mean). Maybe that could speed up the thing, since each commit may touch several files, and in that case, running would be sort of linear with the number of commits instead of number of files touched for all commits. (I think this is what you mention above as "list of paths").
Not yet but I started a branch that can do that now at https://github.com/nexB/scancode-toolkit/pull/1399 and this supports doing things such as git diff --name-only master | xargs scancode -i --json-pp -
and passing multiple paths in one call... it should soon land in develop
See also #875 and #1397 ... this last one by @nicobucher is kinda timely with your evaluation.
On a related note, do you have plans to support Python3 in the near future? That could make things easier too...
Yes, see #295 ... this is going to be a GSoC project
I also have a simple remoting solution using execnet until then and this works nicely: I will push it later today for your enjoyment.
@valeriocos re
I attach the results for nomos and scancode obtained from the analysis of the vorbis repo. In a nutshell, each line in the files is a JSON representing the analysis for a given commit. In the attribute data.analysis you will find the output of scancode/nomos.
Thank you! I will post my eval later.
@valeriocos do you mind trying this https://github.com/nexB/scancode-toolkit/commit/8afa686fb71b9540029234e5a40c0572c4457c28 in the 1397-multiple-inputs branch? There is a README
thank you @pombredanne for this, I'll use it and report.
@valeriocos sure, note that the code has now been merged (in the develop branch)
Thank you @pombredanne and sorry for the late reply. I have prepared a branch to test scancli.py (https://github.com/valeriocos/grimoirelab-graal/blob/test-scancli/graal/backends/core/analyzers/scancode.py#L48). I'll execute some tests and report on the performance.
@valeriocos ping. How things are working out for you?
Hi @pombredanne , sorry for the late reply. I tried the new version of scancode and it's far faster than the previous one. For instance, for the following repo: https://github.com/xiph/vorbis
scancode 3.0.0
took 20 hours, 52 minutes and 39 seconds
scancli
took 3 hours, 36 minutes and 38 seconds
I inspected some of the results and in some cases I see some differences (I guess this is due to some improvements in scancode itself). For instance, for the repo https://github.com/xiph/vorbis and the commit git show 0695c7cbf5d766b7db3c664fa1bb82531c71fa38
I see that:
scancode 3.0.0
identified only one license TU Berlin License 1.0
(see below an excerpt of the data) "commit": "0695c7cbf5d766b7db3c664fa1bb82531c71fa38",
"Author": "Monty <xiphmont@xiph.org>",
"AuthorDate": "Wed Mar 29 03:49:29 2000 +0000",
"Commit": "Monty <xiphmont@xiph.org>",
"CommitDate": "Wed Mar 29 03:49:29 2000 +0000",
"message": "Don't want to lose anything while I'm integrating (also don;t want to\ndisturb mainline till I'm done)\n\nMonty\n\nsvn path=\/branches\/unlabeled-1.18.2\/vorbis\/; revision=286",
"analysis": [
{
"licenses": [
{
"key": "tu-berlin",
"score": 98.9,
"name": "Technische Universitaet Berlin Attribution License 1.0",
"short_name": "TU Berlin License 1.0",
"category": "Permissive",
"is_exception": false,
"owner": "Technische Universitaet Berlin",
"homepage_url": "https:\/\/github.com\/swh\/ladspa\/blob\/7bf6f3799fdba70fda297c2d8fd9f526803d9680\/gsm\/COPYRIGHT",
"text_url": "",
"reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:tu-berlin",
"spdx_license_key": "TU-Berlin-1.0",
"spdx_url": "https:\/\/spdx.org\/licenses\/TU-Berlin-1.0",
"start_line": 30,
"end_line": 39,
"matched_rule": {
"identifier": "tu-berlin.LICENSE",
"license_expression": "tu-berlin",
"licenses": [
"tu-berlin"
],
"is_license_text": true,
"is_license_notice": false,
"is_license_reference": false,
"is_license_tag": false
}
}
],
"file_path": "lib\/lpc.c"
}
],
"analyzer": "scancode"
scancli
identified two licenses GPL-2.0-only
and TU Berlin License 1.0
(see below an excerpt of the data). A manual inspection points out that this result is more precise than the one above. "commit": "0695c7cbf5d766b7db3c664fa1bb82531c71fa38",
"Author": "Monty <xiphmont@xiph.org>",
"AuthorDate": "Wed Mar 29 03:49:29 2000 +0000",
"Commit": "Monty <xiphmont@xiph.org>",
"CommitDate": "Wed Mar 29 03:49:29 2000 +0000",
"message": "Don't want to lose anything while I'm integrating (also don;t want to\ndisturb mainline till I'm done)\n\nMonty\n\nsvn path=\/branches\/unlabeled-1.18.2\/vorbis\/; revision=286",
"analysis": {
"licenses": [
{
"path": "lpc.c",
"type": "file",
"name": "lpc.c",
"base_name": "lpc",
"extension": ".c",
"size": 11008,
"date": "2019-03-15",
"sha1": "71398429be51a79438400d0317dd6c4ab03e97d3",
"md5": "e206cfa46afe1ff773767b934378b14d",
"mime_type": "text\/x-c",
"file_type": "C source, ASCII text",
"programming_language": "C++",
"is_binary": false,
"is_text": true,
"is_archive": false,
"is_media": false,
"is_source": true,
"is_script": false,
"licenses": [
{
"key": "gpl-2.0",
"score": 94.74,
"name": "GNU General Public License 2.0",
"short_name": "GPL 2.0",
"category": "Copyleft",
"is_exception": false,
"owner": "Free Software Foundation (FSF)",
"homepage_url": "http:\/\/www.gnu.org\/licenses\/gpl-2.0.html",
"text_url": "http:\/\/www.gnu.org\/licenses\/gpl-2.0.txt",
"reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:gpl-2.0",
"spdx_license_key": "GPL-2.0-only",
"spdx_url": "https:\/\/spdx.org\/licenses\/GPL-2.0-only",
"start_line": 3,
"end_line": 6,
"matched_rule": {
"identifier": "gpl-2.0_613.RULE",
"license_expression": "gpl-2.0",
"licenses": [
"gpl-2.0"
],
"is_license_text": false,
"is_license_notice": true,
"is_license_reference": false,
"is_license_tag": false,
"matcher": "2-aho",
"rule_length": 36,
"matched_length": 36,
"match_coverage": 100,
"rule_relevance": 100
},
"matched_text": "THIS FILE IS PART OF THE [Ogg] [Vorbis] SOFTWARE CODEC SOURCE CODE. *\n * USE, DISTRIBUTION AND REPRODUCTION OF THIS SOURCE IS GOVERNED BY *\n * THE GNU PUBLIC LICENSE 2, WHICH IS INCLUDED WITH THIS SOURCE. *\n * PLEASE READ THESE TERMS DISTRIBUTING. *"
},
{
"key": "tu-berlin",
"score": 98.9,
"name": "Technische Universitaet Berlin Attribution License 1.0",
"short_name": "TU Berlin License 1.0",
"category": "Permissive",
"is_exception": false,
"owner": "Technische Universitaet Berlin",
"homepage_url": "https:\/\/github.com\/swh\/ladspa\/blob\/7bf6f3799fdba70fda297c2d8fd9f526803d9680\/gsm\/COPYRIGHT",
"text_url": "",
"reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:tu-berlin",
"spdx_license_key": "TU-Berlin-1.0",
"spdx_url": "https:\/\/spdx.org\/licenses\/TU-Berlin-1.0",
"start_line": 30,
"end_line": 39,
"matched_rule": {
"identifier": "tu-berlin.LICENSE",
"license_expression": "tu-berlin",
"licenses": [
"tu-berlin"
],
"is_license_text": true,
"is_license_notice": false,
"is_license_reference": false,
"is_license_tag": false,
"matcher": "3-seq",
"rule_length": 91,
"matched_length": 90,
"match_coverage": 98.9,
"rule_relevance": 100
},
"matched_text": "Any use of this software is permitted provided that this notice is not\nremoved and that neither the authors nor the Technische [Universita]\"[t]\nBerlin are deemed to have made any representations as to the\nsuitability of this software for any purpose nor are held responsible\nfor any defects of this software. THERE IS ABSOLUTELY NO WARRANTY FOR\nTHIS SOFTWARE.\n\nAs a matter of courtesy, the authors request to be informed about uses\nthis software has found, about bugs in this software, and about any\nimprovements that may be of general interest."
}
],
"license_expressions": [
"gpl-2.0",
"tu-berlin"
],
"holders": [
{
"value": "Monty and The XIPHOPHORUS Company",
"start_line": 8,
"end_line": 10
},
{
"value": "Preserved by Jutta Degener and Carsten Bormann, Technische",
"start_line": 25,
"end_line": 28
}
],
"copyrights": [
{
"value": "(c) COPYRIGHT 1994-2000 by Monty <monty@xiph.org> and The XIPHOPHORUS Company http:\/\/www.xiph.org",
"start_line": 8,
"end_line": 10
},
{
"value": "Preserved Copyright 1992, 1993, 1994 by Jutta Degener and Carsten Bormann, Technische",
"start_line": 25,
"end_line": 28
}
],
"authors": [
{
"value": "Jutta Degener and Carsten Bormann",
"start_line": 20,
"end_line": 23
},
{
"value": "J. Durbin",
"start_line": 57,
"end_line": 57
}
],
"files_count": 0,
"dirs_count": 0,
"size_count": 0,
"scan_errors": [
],
"file_path": "lib\/lpc.c"
}
]
},
"analyzer": "scancode"
If you are interested, I can share with you the complete analysis obtained via Graal and scancli
for https://github.com/xiph/vorbis.
@pombredanne should we close this issue or there is something left to discuss?
If scancli
is ready/tested/stable, I'll include it in Graal in the next days.
Thanks!
@valeriocos I would like to make sure everything is a OK before closing this. It also needs some doc for sure!
Thank you for answering @pombredanne , I'll keep an eye on the issue.
@armijnhemel that scanserv is something you wanted too BTW ;)
Hi @pombredanne,
I've embedded scancode (a really nice tool) in Graal and now I'm evaluating scancode against nomos (another popular tool for license analysis) wrt precision and performance.
In a nutshell, the evaluation consists in iterating over the commits of a set of git repositories, for each commit graal performs a checkout and launches scancode/nomos on each file present in the commit, finally the results are persisted on disk. While nomos is pretty fast (it processed 5 repos of around 3000 commits each in 2 hours), scancode is stilll processing the first repo. I'm wondering if I'm missing some parameters (or if you have some suggestions) to make the analysis faster. Currently I'm using the release 3.0.0 and I launch it with the following params: https://github.com/chaoss/grimoirelab-graal/blob/master/graal/backends/core/analyzers/scancode.py#L58
Thank you