aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.11k stars 545 forks source link

stream JSON lines output #3619

Open armijnhemel opened 10 months ago

armijnhemel commented 10 months ago

Short Description

The current JSON lines feature doesn't write information to a file (or stdout) until scancode has completely finished. It might be useful to start writing results earlier. This should be possible (except for the aggregated information in license_detections) as all files are scanned independently (I suppose).

Possible Labels

Select Category

Describe the Update

I want a JSON lines feature that starts writing results to output earlier. As every file is scanned independently this should be possible. The license_detections aggregated information could be written last as an "end of stream" token.

How This Feature will help you/your organization

I want to be able to track progress. Also, I might want to pipe output to another program that processes scancode information without me having to wait until scancode has completely finished. This feature might also help reducing memory usage of scancode, because as soon as the data is written it can be discarded (apart from information for license_detections).

Possible Solution/Implementation Details

Line 1: start of stream/headers Line 2-n: scan results for each file Line n+1: end of stream/license_detections

Example/Links if Any

Can you help with this Feature

pombredanne commented 10 months ago

This would be useful indeed but there may be some gremlins along the way as this is the general processing flow:

  1. Run pre-scan plugins
  2. Run all scanners on one file. Cache the results on disk in a temp JSON file, one such file for each file scanned (this is stored on disk as soon as you have more than 10,000 files scanned)
  3. Run the "process_codebase" step of each scanner one after the other that iterate on the whole codebase and can amend the results from step 1.
  4. Run post-scan plugins one after the other that iterate on the whole codebase and can amend the results from step 1 and step 2.
  5. Run output plugins to create JSON/SBOM outputs

We could technically output some JSON lines early, or some pseudo JSON-like lines, and output the headers and summaries at the end, but there no way to make sure the results are correct and would not be amended after the initial scanner step.

Now, what is your concern? getting some details of the scans early for diagnostic purpose? Or stream processing large scans?

armijnhemel commented 10 months ago

Now, what is your concern? getting some details of the scans early for diagnostic purpose? Or stream processing large scans?

Mostly the latter. Of course, if there are other ways to speed up scancode that would also help a lot.

Another thing is that right now it isn't too easy to extract results for a single file from the larger scancode file. So what I might actually be after is the temp JSON file from the cache (one per file).