aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 550 forks source link

Getting traceback calls & no report is generated. #2908

Open kiranravindran90 opened 2 years ago

kiranravindran90 commented 2 years ago

Getting traceback calls & no report is generated.

How To Reproduce

While scanning source code files, after encountering a .sh file, this error seemed to be popping out (image below). MicrosoftTeams-image

Many Traceback calls were followed by a "Multiprocessing Context Timeout Error"

System configuration

pombredanne commented 2 years ago

@kiranravindran90 Thanks: do you have instructions so we can reproduce your setup? including a full trace?

kiranravindran90 commented 2 years ago

@pombredanne Find the full trace & instruction details.


Scan Instruction : scancode -clpi -n 2 --timeout 3600 --json .json

Package : (~15GB) Linux based Flashwares & source code.

File types : .bin, .bct, .dat, .dtb, .img, .cpio

Trace : Removing temporary files...done. Traceback (most recent call last): File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\Scripts\scancode-script.py", line 33, in sys.exit(load_entry_point('scancode-toolkit', 'console_scripts', 'scancode')()) File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\lib\site-packages\click\core.py", line 1137, in call return self.main(args, kwargs) File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\lib\site-packages\commoncode\cliutils.py", line 69, in main return click.Command.main( File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\lib\site-packages\click\core.py", line 1062, in main rv = self.invoke(ctx) File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\lib\site-packages\click\core.py", line 763, in invoke return __callback(args, *kwargs) File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\lib\site-packages\click\decorators.py", line 26, in new_func return f(get_current_context(), args, **kwargs) File "c:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\src\scancode\cli.py", line 451, in scancode success, _results = run_scan( File "c:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\src\scancode\cli.py", line 887, in run_scan scan_success = run_scanners( File "c:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\src\scancode\cli.py", line 1125, in run_scanners scan_success = scan_codebase( File "c:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\src\scancode\cli.py", line 1236, in scan_codebase scan_timings) = next(scans) File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\lib\site-packages\click_termui_impl.py", line 116, in next return next(iter(self)) File "C:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\lib\site-packages\commoncode\cliutils.py", line 172, in generator for rv in self.iter: File "c:................\scancode-toolkit-30.1.0_py39-windows\scancode-toolkit-30.1.0\src\scancode\pool.py", line 52, in wrap return func(self, timeout=timeout or 3600) File "C:\Program Files\Python39\lib\multiprocessing\pool.py", line 865, in next raise TimeoutError from None multiprocessing.context.TimeoutError


Jeeppler commented 2 years ago

This is related to: #2160

Jeeppler commented 2 years ago

I am facing the same issue.

OS: Linux (Debian) Python: 3.9 Scancode version: 30.1.0 Installation: download (container Dockerfile is public)

Timeout option set to: --timeout 18000 (5 * 3600 = 18000; 3600 seconds are one hour)

However, the scan still stops after 3600 seconds (60 minutes). I looked at https://github.com/nexB/scancode-toolkit/blob/develop/src/scancode/pool.py#L52. However, I could not really understand how it works.

The first, what is this line actually doing:

func(self, timeout=timeout or 3600)

What exactly does the statement timeout or 3600 does?

Second, what is the purpose of the wrapped method/function?

pombredanne commented 2 years ago

@Jeeppler there is a new release of ScanCode and I would like to see if you are also experiencing the issue there. Can you try the latest 31.x? But there is no application archive for Python 3.9 ... only 3.8 for now. Let me review your Dockerfile at https://github.com/mercedes-benz/sechub/blob/develop/sechub-pds-solutions/scancode/docker/ScanCode-Debian.dockerfile

in https://github.com/nexB/scancode-toolkit/blob/6007301cca4eff424202abf197467ee2f004e139/src/scancode/pool.py#L52 The first, what is this line actually doing: func(self, timeout=timeout or 3600) What exactly does the statement timeout or 3600 does?

timeout=timeout or 3600 will assign the timeout argument with the value of the timeout variable if it has a value or with 3600 seconds (aka. 1 hour) otherwise if timeout is None or 0. It essentially provides a default timeout if none was provided.

Second, what is the purpose of the wrapped method/function? in https://github.com/nexB/scancode-toolkit/blob/6007301cca4eff424202abf197467ee2f004e139/src/scancode/pool.py#L39

I reckon it feels a bit arcane. This is a function that returns a function such that calling the inner function will always be subject to a timeout. The code is there because of historical multiprocessing bugs that have been existing in the past and a lot of trial and error to get things working correctly (obviously not enough!):

... long story made short: there are likely some changes and fixes in the code of Python that mean we may not need anymore some of this machinery. The difficult thing here is that we support Windows, macOS and Linux and they have subtly different bugs in their Python implementations at times and not-so-subtle differences on how they each handle forking and spawing processes. We support all of them :|

If you think you could help track down these issues, and have a reliably failing test case, I could may be try to simplify the code in a branch and let you test this?

Jeeppler commented 2 years ago

Can you try the latest 31.x? But there is no application archive for Python 3.9 ... only 3.8 for now.

Since Debian 11 comes with Python 3.9 it would be nice to have an archive for Python 3.9 otherwise I would have to install ScanCode via pip. Are there any problems installing ScanCode via pip rather than the archive?

If you think you could help track down these issues, and have a reliably failing test case, I could may be try to simplify the code in a branch and let you test this?

I would love to help fix the bug, it includes coding something as well.

However, I was unable so far to find public test cases/files to test it with. The files I use, I am not allowed to share. I tried it with ISO images of Linux Distributions, for example with the one from Rocky Linux. Those images are huge, but the scan takes only a few minutes.

In case you have an idea what type of binaries could work let me know. I am willing to test them.

Jeeppler commented 2 years ago

Related to: #2703

Jeeppler commented 2 years ago

Can I disable the timeout if I set it to 0? Or does it set the fallback in this case to 3600 seconds?

Jeeppler commented 2 years ago

I was able to reproduce the issue with the Armbian image of the Jetson Nano:

Armbian_22.08.1_Jetson-nano_jammy_current_5.18.19.img

The image takes more than an hour to scan and on Debian 11 with Scancode-Toolkit version: 30.1.0. I get the following output:

/tools/scancode-toolkit-30.1.0/src/cluecode/copyrights.py:3382: FutureWarning: Possible set difference at position 3
  remove_tags = re.compile(
 Setup plugin: scan:info...
 Setup plugin: scan:licenses...
 Setup plugin: scan:copyrights...
 Setup plugin: scan:packages...
 Setup plugin: scan:emails...
 Setup plugin: scan:urls...
 Setup plugin: output:spdx-tv...
Collect file inventory...
Scan files for: info, licenses, copyrights, packages, emails, urls with 2 process(es)...
Removing temporary files...done.
Traceback (most recent call last):
  File "/tools/scancode-toolkit-30.1.0/bin/scancode", line 33, in <module>
    sys.exit(load_entry_point('scancode-toolkit', 'console_scripts', 'scancode')())
  File "/tools/scancode-toolkit-30.1.0/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/tools/scancode-toolkit-30.1.0/lib/python3.9/site-packages/commoncode/cliutils.py", line 69, in main
    return click.Command.main(
  File "/tools/scancode-toolkit-30.1.0/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/tools/scancode-toolkit-30.1.0/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tools/scancode-toolkit-30.1.0/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/tools/scancode-toolkit-30.1.0/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/tools/scancode-toolkit-30.1.0/src/scancode/cli.py", line 451, in scancode
    success, _results = run_scan(
  File "/tools/scancode-toolkit-30.1.0/src/scancode/cli.py", line 887, in run_scan
    scan_success = run_scanners(
  File "/tools/scancode-toolkit-30.1.0/src/scancode/cli.py", line 1125, in run_scanners
    scan_success = scan_codebase(
  File "/tools/scancode-toolkit-30.1.0/src/scancode/cli.py", line 1236, in scan_codebase
    scan_timings) = next(scans)
  File "/tools/scancode-toolkit-30.1.0/lib/python3.9/site-packages/click/_termui_impl.py", line 116, in __next__
    return next(iter(self))
  File "/tools/scancode-toolkit-30.1.0/lib/python3.9/site-packages/click/_termui_impl.py", line 328, in generator
    for rv in self.iter:
  File "/tools/scancode-toolkit-30.1.0/src/scancode/pool.py", line 52, in wrap
    return func(self, timeout=timeout or 3600)
  File "/usr/lib/python3.9/multiprocessing/pool.py", line 865, in next
    raise TimeoutError from None
multiprocessing.context.TimeoutError

I set --timeout 18000 (5 hours). However, the scan only takes a hour and then I get the multiprocessing.context.TimeoutError.

pombredanne commented 2 years ago

@Jeeppler Thanks for digging further. https://redirect.armbian.com/jetson-nano/Jammy_current is 350MB of xzipped data is massive. So I guess that we should rather skipp scanning such an image for anything but file information. We can do it just based on the size, with a default and a possible new CLI option?

What should be the default max size above which it does not make sense to attempt scanning? 100MB? 200MB?

pombredanne commented 2 years ago

(FWIW, the uncompressed disk image above is 1.7GB ;) )

pombredanne commented 2 years ago

Can I disable the timeout if I set it to 0? Or does it set the fallback in this case to 3600 seconds?

you can but that may make all fail. You can make it super short as 0.001 second if you want

Jeeppler commented 2 years ago

So I guess that we should rather skipp scanning such an image for anything but file information

No, no skipping of the file. The requirement is to scan files in that size or bigger for license information. It takes hours, but Scancode can find license information in files of that size.

In case, you have more questions about the use case you can reach me via email using the email address located in this file: https://github.com/mercedes-benz/sechub/blob/develop/MAINTAINERS.md

Jeeppler commented 2 years ago

The problem is, that even if I set the --timeout above 3600 I get the multiprocessing.context.TimeoutError after exactly one hour (3600 seconds = 1 hour). My expectation, would be that --timeout 18000 will increase the timeout duration to 5 hours and the scan might work.

I encountered the problem under Linux. It apparently works under Windows.

Jeeppler commented 2 years ago

I ran the test again with the new version of scancode under Ubuntu 22.04 (Mint 21).

Setup:

$ python3 --version
Python 3.10.6

$ scancode --version
ScanCode version: 31.2.1
ScanCode Output Format version: 2.0.0
SPDX License list version: 3.17

I installed scancode via pip.

$ time scancode --timeout 3800 --verbose --strip-root --copyright --license --package --email --url --info --spdx-tv scan.txt Armbian_22.08.1_Jetson-nano_jammy_current_5.18.19.img
 Setup plugin: scan:info...
 Setup plugin: scan:licenses...
 Setup plugin: scan:copyrights...
 Setup plugin: scan:packages...
 Setup plugin: scan:emails...
 Setup plugin: scan:urls...
 Setup plugin: output:spdx-tv...
Collect file inventory...
Scan files for: info, licenses, copyrights, packages, emails, urls with 1 process(es)...
Removing temporary files...done.
Traceback (most recent call last):
  File "/home/user/.local/bin/scancode", line 8, in <module>
    sys.exit(scancode())
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/commoncode/cliutils.py", line 70, in main
    return click.Command.main(
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/scancode/cli.py", line 452, in scancode
    success, _results = run_scan(
  File "/home/user/.local/lib/python3.10/site-packages/scancode/cli.py", line 905, in run_scan
    scan_success = run_scanners(
  File "/home/user/.local/lib/python3.10/site-packages/scancode/cli.py", line 1144, in run_scanners
    scan_success = scan_codebase(
  File "/home/user/.local/lib/python3.10/site-packages/scancode/cli.py", line 1252, in scan_codebase
    scan_timings) = next(scans)
  File "/home/user/.local/lib/python3.10/site-packages/click/_termui_impl.py", line 116, in __next__
    return next(iter(self))
  File "/home/user/.local/lib/python3.10/site-packages/click/_termui_impl.py", line 328, in generator
    for rv in self.iter:
  File "/home/user/.local/lib/python3.10/site-packages/scancode/pool.py", line 52, in wrap
    return func(self, timeout=timeout or 3600)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 868, in next
    raise TimeoutError from None
multiprocessing.context.TimeoutError

real    60m43,960s
user    4m13,170s
sys 3m25,289s

Scancode was running 60 minutes, before raising the multiprocessing.context.TimeoutError. However, I specified a timeout of --timeout 3800, therefore scancode should have ran for 63 minutes.

pombredanne commented 2 years ago

@Jeeppler Thanks.... clearly we need to revisit the timeout story! Thank you for all these efforts.

pombredanne commented 2 years ago

@Jeeppler BTW https://github.com/mercedes-benz/sechub looks awesome! You may find https://github.com/nexb/vulnerablecode and https://github.com/nexB/vulnerablecode/pull/801 of some interest.

pombredanne commented 2 years ago

@Jeeppler

Since Debian 11 comes with Python 3.9 it would be nice to have an archive for Python 3.9

OK. Let's have a issue to track this. Do you mind to enter one? This should be easy enough now that we automated the whole release build. An alternative could be an appimage of sorts. See also #2836

otherwise I would have to install ScanCode via pip. Are there any problems installing ScanCode via pip rather than the archive?

This should work fine and is tested on each build using the latest version of all deps with https://github.com/nexB/scancode-toolkit/blob/ded56e9120f5fdfb9a1a0309130bb4305a66aacb/azure-pipelines.yml#L194 But I cannot gaurantee that all the versions of the dependencies that otherwise satisfy the constraints work, as we do retest previously released versions with newer deps: that would be practically difficult and resource hungry.

If you want to have the deps "same-as-the-app", then something such as this may be better as it will use the same versions as the app:

pip install --constraint https://raw.githubusercontent.com/nexB/scancode-toolkit/v31.2.1/requirements.txt scancode-toolkit==31.2.1

Jeeppler commented 2 years ago

@pombredanne

Thanks for explaining what the difference between the package archive and the pip installation is. I did not know about the --constraint flag, that is interesting.

Furthermore:

Jeeppler commented 2 years ago

I did another test with the Armbian image: Armbian_22.08.1_Jetson-nano_jammy_current_5.18.19.img. What I observed was that, first the scancode process has a waiting channel of 0, meaning it is running. The CPU usage is very high. One core will be at 100% of utilization. Indicating to me: scancode does it's work.

After a few minutes (approx. 3-5), all the memory and the entire swap partition are used. See attached screenshots.

scancode_memory_cpu_usage

Then suddenly, CPU utilization and memory drop. Looking at the processes. The scancode process waiting channel will be stuck at pipe_read. I am not sure what happened with scancode:

Scan command used:

time scancode --timeout 36000 --verbose --strip-root --license --spdx-tv Armbian.txt Armbian_22.08.1_Jetson-nano_jammy_current_5.18.19.img
Jeeppler commented 2 years ago

I see the same pattern with a 64 GB machine.

Is there are way to profile/investigate it?

Jeeppler commented 1 year ago

I was able to confirm: pool.py#L52 is the problem. We overwrote it with a fixed value: https://github.com/mercedes-benz/sechub/blob/develop/sechub-pds-solutions/scancode/docker/Scancode-Debian.dockerfile#L52. Now, it does work for us.

I am not sure what the issue is with Armbian_22.08.1_Jetson-nano_jammy_current_5.18.19.img and the memory usage. However, Scancode does need a lot of memory to scan a large file.

pombredanne commented 1 year ago

@Jeeppler Very nicely finding. Thank you. This may be related to https://github.com/nexB/scancode-toolkit/issues/3143 by @nnobelis @nnobelis Do you mind to mind this patch on your big scan? @Jeeppler Do you mind to submit a patch/PR?

pombredanne commented 1 year ago

I am not sure what the issue is with Armbian_22.08.1_Jetson-nano_jammy_current_5.18.19.img and the memory usage. However, Scancode does need a lot of memory to scan a large file.

Each whole file is loaded in memory indeed

Jeeppler commented 1 year ago

@pombredanne I am happy to submit a patch/PR and fix this issue.

However, my current solution is to overwrite pool.py with brute force. I would have to look at the code in depth to understand it better. I probably will have more question along the way about the Scancode code base.

My expectation is, that the fix allows the user to set the value using the --timeout parameter. I do not consider hard coding the timeout value to 10 hours as a long term solution.

nnobelis commented 1 year ago

@pombredanne No sorry I cannot test the changes unless there is a ScanCode release of it (I get ScanCode from PIP when building the Docker image).

Jeeppler commented 1 year ago

@nnobelis we install the Scancode using pip as well, but later overwrite the pool.py in the container:

nnobelis commented 1 year ago

Thanks @Jeeppler. I will have a look when I have capacity.