aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 550 forks source link

Emails are only reported once #2444

Open tdruez opened 3 years ago

tdruez commented 3 years ago

When scanning the following text, the detection of daniel@haxx.se is only returned once in the results while it appears multiple times in the file.

scancode -ce --json-pp -

Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: curl
Source: http://curl.haxx.se

Files: *
Copyright: 1996-2015, Daniel Stenberg <daniel@haxx.se>
License: curl

Files: lib/vtls/axtls.*
Copyright: 2010, DirecTV
 2010-2015, Daniel Stenberg <daniel@haxx.se>
License: curl

Files: lib/vtls/darwinssl.*
Copyright: 2012-2014, Nick Zitzmann <nickzman@gmail.com>
 2012-2015, Daniel Stenberg <daniel@haxx.se>
License: curl

Files: lib/curl_rtmp.*
Copyright: 2010, Howard Chu <hyc@highlandsun.com>
License: curl

Files: lib/vtls/schannel.*
Copyright: 2012-2014, Marc Hoersken <info@marc-hoersken.de>
 2012, Mark Salisbury <mark.salisbury@hp.com>
 2012-2015, Daniel Stenberg <daniel@haxx.se>
License: curl

Results:

"emails": [
        {
          "email": "daniel@haxx.se",
          "start_line": 6,
          "end_line": 6
        },
        {
          "email": "nickzman@gmail.com",
          "start_line": 15,
          "end_line": 15
        },
        {
          "email": "hyc@highlandsun.com",
          "start_line": 20,
          "end_line": 20
        },
        {
          "email": "info@marc-hoersken.de",
          "start_line": 24,
          "end_line": 24
        },
        {
          "email": "mark.salisbury@hp.com",
          "start_line": 25,
          "end_line": 25
        }
      ],
Ayushsunny commented 3 years ago

Hello @tdruez I wanna work on this issue, so can you explain a bit more that where this code file is and what's the exact issue if the result is right?

sritasngh commented 3 years ago

@tdruez general observation: $ ./scancode -e --json-pp - input.txt

input.txt:

Files: lib/vtls/schannel.*
Copyright: 2012-2014, Marc Hoersken <info@marc-hoersken.de>
 2012, Mark Salisbury <mark.salisbury@hp.com>
 2012-2015, Daniel Stenberg <daniel@haxx.se>
License: curl

Files: lib/vtls/darwinssl.*
Copyright: 2012-2014, Nick Zitzmann <nickzman@gmail.com>
 2012-2015, Daniel Stenberg <daniel@haxx.se>
License: curl

Files: lib/vtls/darwinssl.*
Copyright: 2012-2014, Nick Zitzmann <nickzman@gmail.com>
 2012-2015, Daniel Stenberg <daniel@haxx.se>
License: curl

Result:

      "emails": [
        {
          "email": "info@marc-hoersken.de",
          "start_line": 2,
          "end_line": 2
        },
        {
          "email": "mark.salisbury@hp.com",
          "start_line": 3,
          "end_line": 3
        },
        {
          "email": "daniel@haxx.se",
          "start_line": 4,
          "end_line": 4
        },
        {
          "email": "nickzman@gmail.com",
          "start_line": 8,
          "end_line": 8
        }
      ],
      "scan_errors": []

So whenever email has been repeated, it has listed only that had been occured first. @pombredanne Is this bug or a feature?

sritasngh commented 3 years ago

Hello @tdruez I wanna work on this issue, so can you explain a bit more that where this code file is and what's the exact issue if the result is right?

@Ayushsunny You can get code in /src/cluecode/

pombredanne commented 3 years ago

@itssingh re:

Is this bug or a feature?

a bit of both.... there are two sides:

  1. there is a built-in limit to the number of emails or URLs reported https://github.com/nexB/scancode-toolkit/blob/96c73a2761eee3c1d8ba57c47efaa475f7459409/src/cluecode/plugin_email.py#L37 and https://github.com/nexB/scancode-toolkit/blob/96c73a2761eee3c1d8ba57c47efaa475f7459409/src/cluecode/plugin_url.py#L38
  2. there is a unique Flag in https://github.com/nexB/scancode-toolkit/blob/96c73a2761eee3c1d8ba57c47efaa475f7459409/src/cluecode/finder.py#L127 which should likely not be there by default OR might need to be exposed in the CLI as an option (See for URLs https://github.com/nexB/scancode-toolkit/blob/96c73a2761eee3c1d8ba57c47efaa475f7459409/src/cluecode/finder.py#L200 )