aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.1k stars 544 forks source link

Install via pip tries to use cache directory in /usr/local/lib #685

Closed dengste closed 5 years ago

dengste commented 7 years ago

I installed 'scancode-toolkit' via 'pip' on a Debian 9 box. On startup I get the following backtrace because a normal user obviously does not have permission to create files in /usr/local/lib:

 Traceback (most recent call last):
   File "/usr/local/bin/scancode", line 11, in <module>
 sys.exit(scancode())
   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in __call__
 return self.main(*args, **kwargs)
   File "/usr/local/lib/python2.7/dist-packages/scancode/utils.py", line 76, in main
 standalone_mode=standalone_mode, **extra)
   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
 rv = self.invoke(ctx)
   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
 return ctx.invoke(self.callback, **ctx.params)
   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
 return callback(*args, **kwargs)
   File "/usr/local/lib/python2.7/dist-packages/click/decorators.py", line 17, in new_func
 return f(get_current_context(), *args, **kwargs)
   File "/usr/local/lib/python2.7/dist-packages/scancode/cli.py", line 386, in scancode
 scans_cache_class = get_scans_cache_class()
   File "/usr/local/lib/python2.7/dist-packages/scancode/cache.py", line 93, in get_scans_cache_class
 cache_dir = fileutils.get_temp_dir(unicode(cache_dir), prefix=unicode(timeutils.time2tstamp()) + u'-')
   File "/usr/local/lib/python2.7/dist-packages/commoncode/fileutils.py", line 128, in get_temp_dir
 return tempfile.mkdtemp(prefix=prefix, dir=base)
   File "/usr/lib/python2.7/tempfile.py", line 339, in mkdtemp
 _os.mkdir(file, 0700)
 OSError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/.cache/scan_results_caches/2017-07-13T121227
JonoYang commented 7 years ago

@dengste You can opt to install pip packages to your home directory using the --user option. If you wanted to install it systemwide, you'd need to run pip with root access.

pombredanne commented 7 years ago

@JonoYang I think this has been installed system wide with sudo, right @dengste ? I reckon this is a somewhat uncommon installation pattern that was never tried before! What I think happens here is that ScanCode creates a cache for scan results and the license index. This cache is by default always created in the root directory which might end up being (incorrectly I reckon) /usr/local/lib/python2.7/.cache in the case of a global, system-wide installation... hence the permission denied error that you see.

A short term work around might be to do what @JonoYang suggests, e.g. a local user install. Not sure we have tested this either.

Another way is to install with pip in a local virtualenv: this is tested alright

Yet another way is fetch a release from https://github.com/nexB/scancode-toolkit/releases

pombredanne commented 7 years ago

Now this is a great bug ... especially since #487 absolutely needs this fixed.

The cache consists of two elements for now: a license index cache that could be global to a given installation and a transient per scan cache that would be best as local to a user.

For the sake of simplification, this might be best written to ~/.cache when the root directory of the scancode installation is not writeable (which is the case for a system install)....

There is another twist which is the license cache cannot be shared between multiple installations of ScanCode unless they have the exact same code and data. The license cache invalidation is done by taking a checksums of the installation directory tree per https://github.com/nexB/scancode-toolkit/blob/23d835ff2ad0587eedef4f9eebef87ee1ff0fe48/src/licensedcode/cache.py#L63 ... which breaks entirely when we have a pip installation, either local or global.

So it needs some thoughts.... one possibility could be to get the version of scancode (which is unique even when building/running from a checkout) and get or create a cache directory in ~/.cache/scancode/<version>/ at startup.... this would end in each user directory. This cache would then be used for this user's runs both for the scan-level and and the license-level caches.

The whole license cache invalidation and check needs to be reworked too as in a pip install there is no root directory any more: the root ends up being two levels above a module directory hence it lands in the very ugly /usr/local/lib/python2.7/ directory.

@dengste Thanks ++ for this great find!

pombredanne commented 7 years ago

This is also related to #357

pombredanne commented 7 years ago

So it needs some thoughts.... one possibility could be to get the version of scancode (which is unique even when building/running from a checkout) and get or create a cache directory in ~/.cache/scancode// at startup.... this would end in each user directory. This cache would then be used for this user's runs both for the scan-level and and the license-level caches.

Another thing which is about convenience for development or for a tarball/zip release installation: ideally in this case the .cache would be best local to the installation or checkout directory

dengste commented 7 years ago

Quick followup: Yes, I ran pip as root and did a systemwide install so that all users on the system can use it. When running as normal user, I would indeed expect scancode to use a directory in the user's home, either ~/.cache, or following the xdg specification (.local/share/scancode or similar).

pombredanne commented 7 years ago

@dengste Thanks. I guess ScanCode can run on a non-desktop so may be using a ~/.cache might be cleaner than relying on the slightly more complex XDG spec and environment (especially since its environment variables may not be present if nothing of X/Freedesktop is installed) ... It also allows to have consistent approach across OSes and seems common enough and clear.

pombredanne commented 6 years ago

this is eventually a problem with Docker images too, such as this: https://github.com/heremaps/oss-review-toolkit/blob/7b6fdfab6a35e4c1ce1362fd62e31ed023ef109d/Dockerfile

pombredanne commented 6 years ago

@sschuberth ping, for reference. This may impact your code .

pombredanne commented 6 years ago

@haikoschol @sschuberth

let me restate the skinny of what we need there (and this is also for me to rehash where we are and a solution) ... this is while I am reviewing #878

There are three cases where ScanCode writes things to disk that are not under the direct control of the user:

Of course scan results are written to files too, but this is at the the user-selected path. This are not a problem.

There are three run modes for ScanCode:

Let's look first at the temporary files written to disk in abstract of the usage mode as they are today:

1. temporary files that are short-lived for the duration of a function call

This is under the control of two functions used globally:

get_temp_dir(base_dir, prefix='') is the function used everywhere to get a temp dir (with the exception of some rare test cases that are not a problem here and create temp files directly for a special test case setup ). By design, there is NOT other way that is used that would call the tempfile standard module. It uses the arg-less system_temp_dir() to get the root of temp files which is essentially a global.

system_temp_dir() is arg-less and returns a system-wide temporary root directory. As a global this is the essence of the problem to have configurable temporary files locations for 1. and 2. This resolves today to a root directory being either:

  1. defined by the SCANCODE_TMP env var is present
  2. or the tempfile.gettempdir() call which returns various values:
    Return the directory currently selected to create temporary files in. If tempdir is not None, this simply returns its contents; otherwise, the search described above is performed, and the result returned.

    and the tempfile.tempdir global is set using a search of environment variables and eventually the current working dir.

As a note there is a large number of temp test files created during test runs.

2. temporary files that are longer-lived for the duration of a scan run

This is mostly the on-disk cached scan results to a scanner run. This is saved today in globals defined in scancode.__init.py e.g. <root of scancode install>/.cache/scan_results_caches and then a per-sun directory created in the warty cache.py with get_scans_cache_class(). Each scan run gets a new unique temp dir under .cache which also contains a timestamp.

3. cached files that are long-lived across scan runs and are valid for a given version of the code and data

This is mostly the license index on-disk cache, which is a costly operation and does not need to be recreated unless the license detection data or the code has changed (ideally this cache should be invalidated only if certain code changes were made and certain data changes were made, but cache invalidation is a complex matter, so any code or data change in the whole ScanCode tree should invalidate the cache). This goes to <root of scancode install>/.cache/license_index as defined in yet more globals with a complex cache validation mechanism using hashes of file paths and timestamps defined in licensedcode.cache.py The cache location is something that can be passed as args when you licensedcode.cache.get_index() so the globals can be bypassed.

So all in all we have many layers of globals that make things complicated.

Now let's look at the different run modes

... and what we have today and where we should go IMHO

A. a regular user is running the CLI either from a system-wide installation, a release archive, a git checkout or a snapshot archive.

This is and should be the default run mode unless another run mode is configured or selected (see B. and C.).

Today all goes in <install dir>/.cache for 2. and 3. and basically to /tmp/scancode/... for 1.

Also the initial "configuration" is made by the ./configure script when calling the ./scancode script. All paths there are essentially soft-hardcoded to be based on the root of the ScanCode installation dir. (which is the crux of the problem here BTW)

Instead there should be:

Since the code+data is supposed and assumed to be stable for a given version, there should be no need to check/invalidate persistent caches like the license cache.

Note that on *nix/Linux/POSIX, the common way is to store cached data at ~/user/.cache/<app name>

B. a developer is running the CLI and running tests in development mode from a checkout.

Since here the code+data evolves and therefore each new run is potentially using a new ScanCode version.

The temp files for 1. go to the system-wide temp dir and the 2. and 3. to <code root/.cache>

Today there is a special tag file created when calling ./configure (which defaults to create a dev configuration, IMHO a bad idea) This is <code root/SCANCODE_DEV_MODE>. This file is then checked and if present the license cache in 3. is revalidated on each run. This does not happen otherwise in mode A. where the index cache is created only once at the first run (and this implies that the ./scancode script is exclusively used for this)

Generally it makes sense to keep the cache local to the checkout directory and therefore the SCANCODE_DEV_MODE tag file kind of make sense. Note also that a ./configure call with the dev conf also cleans the .cache and regen the license index which is a tad of a pain as this slow things down.

C. ScanCode is used as a library embedded in some other application and called programmatically

Basically today the same as A. happens. Which is mostly B. except for cache checks.

Based on all this here is what I suggest and want to implement:

At a high level, get rid of globals as much as possible and use them for the user mode A. only as a default. This would be also the default for mode B. And allow the configuration of the temporary and cache directories at various levels.

Putting the needs of users in mode A. and C. first, this would mean:

For the mode C. (library usage)

For the mode B. (development)

haikoschol commented 6 years ago

@pombredanne Thanks for the detailed writeup! I've made a few attempts at digesting it, but I'm still not sure how exactly this affect #878. Overall it seems very complex to me and I believe a few permutations of use cases and installation methods could be dropped. But that goes down a deep rabbit hole where I'm wondering why the configure script and all the custom setup logic exists.

I fully agree with your suggestions to minimize globals and have "dev mode" not be the default. Adding command line options to set paths for temp and cache files also makes sense.

pombredanne commented 6 years ago

@haikoschol

Thanks for the detailed writeup! I've made a few attempts at digesting it, but I'm still not sure how exactly this affect #878. Overall it seems very complex to me and I believe a few permutations of use cases and installation methods could be dropped.

Actually the writeup is complex, but the code ends up being rather simple after all.

But that goes down a deep rabbit hole where I'm wondering why the configure script and all the custom setup logic exists.

That's a fair question. Very simply, configure + thirdparty exist such that when scancode is used as an app, it is fully self-contained and working out of the box for anyone with only Python. The alternative could be to use a script which uses requirement files .... but that's exactly what configure is: a script that call pip and requirement files :P Of course none of this is strictly needed in development and not needed when used as library. But at least for dev+tests, it also ensure that there is a strict control of the deps installed and used: here everyone uses strictly the same ones (one or more pinned pip requirements file would achieve the same though it would not help with installing as a self-contained application)

I fully agree with your suggestions to minimize globals and have "dev mode" not be the default. Adding command line options to set paths for temp and cache files also makes sense.

These are not in the #885 branch.

pombredanne commented 6 years ago

@dengste Thank you for your patience! The code from #885 has been merged in develop and you should no longer have these issues. The default temp is /tmp and the default cache goes to ~/.cache/scancode-tk Closing!

sschuberth commented 6 years ago

@pombredanne, strictly speaking this should only be closed once there's a new release on PyPI, IMO.

pombredanne commented 6 years ago

@sschuberth good point, but I am generally trying to avoid keeping open yet fixed ticket :P I will make an exception!

sschuberth commented 6 years ago

IMO part of the "definition of done" should be: The original reporter is in a position to (easily) verify the fix. And here he / she cannot unless you publish a new package šŸ˜

pombredanne commented 6 years ago

I sometimes wonder what I would do without you keeping me honest and in check! šŸ‘

pombredanne commented 6 years ago

@dengste feedback welcomed. I would like to close this since this has been fixed in develop

pombredanne commented 5 years ago

This is merged in develop and all tags since 2.9.7 at least