anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.2k stars 571 forks source link

Adding "Stats" on the scan inside the json #3157

Open TimBrown1611 opened 2 months ago

TimBrown1611 commented 2 months ago

What would you like to be added: I had an idea to add "Stats" attribute to the schema, which will includes like "metadata" on the scan. for example: adding how much time each cataloger took. I think this kind of information can make Syft tool better, since users could analyze better which tasks took longer, and tune it according to their needs (enable \ disable catalogers for example). Why is this needed: When running in debug mode things are going much slower, moreover it is much harder to work with logs instead of JSONs which you can analyze in much easier way without decreasing the performance. Additional context: it is related to this pr - https://github.com/anchore/syft/pull/3105

Do you think this kind of feature can be considered in the future? since it includes adding information to the final schema.

wagoodman commented 2 months ago

The SBOM is really attempting to capture what was found and how it was found. We attempt to keep auxiliary / secondary metadata out of the SBOM, especially if that data would make the SBOM more difficult to reproduce (such as timestamps).

It sounds like from your use case you're interested in tracking along side the SBOM these performance metrics for future tuning of configuration -- is this true? If that's the case, we still want the configuration used to be captured in the SBOM (as it is today) but having the capability to configure syft to output structured logging would go along way here. That way you'd need the SBOM and logs from a run to tune configuration for performance.

We're about 90% there code-wise. Here's what I mean:

log:
  # suppress all logging output (env: SYFT_LOG_QUIET)
  quiet: false

  # increase verbosity (-v = info, -vv = debug) (env: SYFT_LOG_VERBOSITY)
  verbosity: 0

  # explicitly set the logging level (available: [error warn info debug trace]) (env: SYFT_LOG_LEVEL)
  level: 'warn'

  # file path to write logs to (env: SYFT_LOG_FILE)
  file: ''

What's missing is the ability to specify structured logging from the config:

log:
   # output JSONL objects
   structured: true

or something like:

log:
   # in the future we could add support for more formats...
   format: jsonl

As well as the code change in clio to respond to that configuration.

@TimBrown1611 would this be a possible path forward for you?

TimBrown1611 commented 2 months ago

So if I understand you correctly, you suggest to make another mode of configuration which will create attribute in the response json containing information about the scan. right? if this is the case, I think it can be a good solution. by the way, another thing I think we can add there is the indexing time. According to your last community meeting you discussed on some solutions to the indexing, this tool can help to monitor the performance without using verbose mode (which can decrease the performance)

in case this direction sounds good, I can try and create a PR on this. @wagoodman

kzantow commented 2 months ago

We don't want to put the stats in the Syft output, but adding log times, as your(?) PR does would be a great idea. I left a comment about a problem on the PR, I can push a fix for it if you like.

tomersein commented 2 months ago

I fixed the comment @kzantow @wagoodman I might not understand your comment, can you please explain?

kzantow commented 2 months ago

What @wagoodman is talking about is outputting Syft logs as JSON, not including stats in the SBOM itself.

Today, we have the ability to have "structured" logging -- the log.WithFields("name", value,...) thing. We should be using this to output variables in logs, as the task execution time now does. Today, we only have the ability to output plain logs, but we should be able to replace the logger with one that instead takes these structured log messages and outputs JSON, so you could run syft using something like like:

SYFT_LOG_FORMAT=json SYFT_LOG_FILE=log.json syft ...

... and syft would output JSON, with the structured fields split out as JSON object fields instead of outputting plain text. Presumably you would see log entries something like below which would make it easier for you to filter the right entries and get the info you need:

{
  "message": "task completed",
  "task": "javascript-cataloger",
  "elapsed": "904ms",
}
willmurphyscode commented 4 weeks ago

Notes to whoever picks this up: The scope here is to add structured logging to syft (if it doesn't already exist) so that log lines can be JSON documents, for example, and then to add some timing related log messages that look similar to one another enough to be greppable, so that someone can configure syft with file system logging in JSON format, and then easily parse the log to understand how long different parts of the cataloging took. Adding statistics about the scan's performance to the SBOM output is out of scope.