Add support for caching of individual files scan results

RomanIakovlev commented 1 week ago

Short Description

Add support for persistent scan cache on an individual files level, to improve performance of repeated scans, as well as the scans of similar packages.

Possible Labels

new feature

Select Category

[x] Enhancement
[ ] Add License/Copyright
[x] Scan Feature
[ ] Packaging
[ ] Documentation
[ ] Expand Support
[ ] Other

Describe the Update

When scanning subsequent releases of the same package, it would be great to avoid re-scanning unchanged files. This could increase performance of such scans dramatically, especially for large projects with frequent releases, because it's pretty typical that only a small subset of files changes between releases.

Concretely speaking, I propose adding support for the scan cache. The way it might work is as follows. Before scanner function runs on a file, it might consult the scan cache and see if this file, identified by its hash, has already been processed earlier. If it can find such cached result, the scanner function returns that cached result immediately, without executing its main logic.

This cache should obviously be persistent across scancode runs to be useful, although cache invalidation and bypass mechanisms might be required.

How This Feature will help you/your organization

ClearlyDefined project would benefit a lot from this feature. Scanning releases of the popular packages is the main use case for ClearlyDefined, and it spends a lot of compute resources re-scanning unchanged files.

Possible Solution/Implementation Details

I think this could potentially be implemented as a scan-time plugin, as described here.

Can you help with this Feature

I can help with this feature, if provided some guidance.

cc @pombredanne

P.S. if something like this is already possible in ScanCode-Toolkit, could you please point me towards the relevant documentation? I've seen mentions of SCANCODE_CACHE environment variable, but as far as I understand, it's a different type of cache.

elrayle commented 5 days ago

@RomanIakovlev You may already envision this, but I want to state it explicitly. This would be even more powerful if there is a hash for each tree in the directory structure. That would allow avoiding scanning for many many files if a directory hasn't changed.

The value of this is that there are a lot of releases (e.g. bug fixes, patches, even some minor versions) that change only a small subset of the files. If the files are unchanged, then any licenses coming from those files are unchanged.

@pombredanne What are your thoughts on this?

AyanSinhaMahapatra commented 5 days ago

@RomanIakovlev thank you for the issue and thanks for offering to help! This would be a great feature indeed. @elrayle thank you for your comment and suggestions too

We have a subset of the functionality in deltacode and some plans to move this into ScanCode and integrate this through purldb (this is a database of packages) so here if you scan a new version of a package we can check the previous version we have for the package, do a delta, and only scan the changed files, and apply summaries/curations for the same based on changes in key files.

This would be extremely beneficial as we don't rescan files, and increase the speed of package scans where we have previous versions in the db already. We would be very happy to help the effort here as required, let us know if you have questions or need any clarifications.

elrayle commented 4 days ago

I had two thoughts on a need to invalidate the cache, however it is implemented. In both cases, scancode may produce a different result than in previous scans.

updating scancode versions
updating the list of supported licenses in the scancode db

The invalidation for both would need to be controlled by the app using scancode-toolkit as the app determines the scancode version in use and, at least for ClearlyDefined, we keep a translation map for scancode licenses that gets updated periodically.

AyanSinhaMahapatra commented 4 days ago

In both cases, scancode may produce a different result than in previous scans.

Great point!

We have a --todo option now in scancode-toolkit to get any potential issues present in license/package manifest detection (missing license/license rule or missing support for a package manifest type, for example, more types of issues can be added here), so if we run with this option and there are no issues present we can safely assume that a new scancode/licensedb version will not effect the scan results majorly.

But there probably needs to be some cache invalidation for major scancode version updates, as these are usually much larger changes with lots of updates.

RomanIakovlev commented 4 days ago

Thanks @AyanSinhaMahapatra for your feedback, I'm glad you're open to adding this. It would be useful for me to figure out the general approach for implementing this.

In my original post above I've suggested using a scan-time plugin for this feature. I'm not really sure it's a good option, because I've only spent some short time reading through ScanCode docs, and I'm not familiar with its codebase at all. The good first step would be to make sure we are certain about the high-level architectural decisions like this one. Do you have some suggestions on how this should be implemented?

pombredanne commented 4 days ago

@elrayle re:

at least for ClearlyDefined, we keep a translation map for scancode licenses that gets updated periodically.

Can you elaborate on the purpose of this map?

elrayle commented 3 days ago

@pombredanne RE: scancode mapping - This map has been around since 2018. In it's current state, it is generated from a GH workflow that reads https://scancode-licensedb.aboutcode.org/index.json and uses it to generate the map.

My understanding from discussions with others on the project is that some of the retrieved licenses are not valid SPDX. If scancode recognizes the license, we want to set it to the known SPDX license or to a scancode LicenseRef. If this becomes unnecessary in the future, we can reevaluate it.

elrayle commented 3 days ago

@pombredanne @AyanSinhaMahapatra How complex do you think it would be to add support for a cache? If you were going to do the work, how long do you think it would take? We are doing some planning. With your familiarity with the code, I'm hoping you have a sense of the effort involved.

aboutcode-org / scancode-toolkit