Open jelmer opened 4 years ago
Machine-readable output is not feasible in general nor in-scope for cats, since it runs a whole bunch of different tools that have no standardisation of output format nor even of exit code and cats is specifically designed to cater to this set of tools by focussing on the lowest common denominator of whether there is any tool output at all.
In addition cats checks files (like RPM or PNG) that don't have "lines" so that part isn't feasible in general either.
In addition most of the tools cats runs don't have anything like lintian's tags and associated descriptions so that part isn't feasible in general either.
The only way to achieve machine-readable output would be to duplicate the Fedora Firehose project, which defines the Firehose static analysis output format, adds output support for that format to some tools and then adds a bunch of parsers for non-standard output formats and a transformation from those parsers to the Firehose format. I don't think it is a useful to duplicate an existing project anyway. Unfortunately the Fedora project seems to have abandoned Firehose, so probably it needs to be revived.
https://github.com/fedora-static-analysis/firehose https://fedoraproject.org/wiki/StaticAnalysis
In addition there are various commercial semi-proprietary services like GitHub, Code Climate, Sonar Cube and so on that basically duplicate Fedora Firehose more successfully and present the output using a web interface. Outside of Debian no-one cares about using proprietary services to check code and prefers web output to the text output of cats so I don't think cats or anything like Fedora Firehose has much chance of competing with them, unless it is also a commercial hosted service that is gratis to open source projects and costs money for enterprises.
-- bye, pabs
Ah, I see - I guess check-all-the-things has slightly different goals from what I'm looking for in that case. I wasn't aware of firehose effort, that does actually look a lot like what I'm after.
Thanks!
Yeah, cats aims to expose the output of as many QA tools as possible to anyone using a terminal. Going the lowest common denominator of printing text output allows it to get broad coverage of tools but at the same time this also means that it is fairly obsolete because users of IDEs and web based cats equivalents have no reason to use the tool so the audience is restricted to those who code on terminals, which is slowly shrinking over time.
Another trend in static analysis is to communicate results between launguage implementations and IDEs using the Language Server Protocol. This also means that cats intended broad coverage of tools is made harder as language implementations don't output the same set of analysis over non-LSP formats.
So if the Firehose project gets revived, it might be interesting to integrate cats with it in some way to broaden the audience for cats, so I think perhaps we should keep this issue open.
BTW: if you want access to collab-qa as a place to fork Firehose or any other QA tools, I'd be happy to see that happen.
-- bye, pabs
So I think I have convinced myself that machine-readable output is actually a good idea afterall.
I think that the support should end up something like as follows.
The Firehose project needs to be been revived or forked.
Have the Firehose project split the set of output parsers from the code that runs the tests that the parsers know about so that different tools (like cats) can use these same parsers.
A modifier that can be used like this to turn on firehose output:
command = foo {fh}--output=firehose{/fh}
Some options that can be used to enable parsing non-standard output for a specific command:
command = foo
fh-parser = file-line-column
A command-line option to turn on machine-readable output in commands and turn on parsing of non-standard output and transformation into machine-readable output.
Some similar support for LSP as an alternative to Firehose for the audience that are IDE users.
-- bye, pabs
I e-mailed the author of Firehose, who mentioned he's no longer working on it but pointed me at SARIF (https://docs.oasis-open.org/sarif/sarif/v2.0/csprd01/sarif-v2.0-csprd01.html) as a possible alternative.
The spec is very long, but the output doesn't look that terrible; see e.g. https://docs.oasis-open.org/sarif/sarif/v2.0/csprd01/sarif-v2.0-csprd01.html#_Toc517436281
On Sat, 2020-03-07 at 01:54 -0800, Jelmer Vernooij wrote:
I e-mailed the author of Firehose, who mentioned he's no longer working on it but pointed me at SARIF as a possible alternative.
Interesting, I hadn't heard of that before.
The spec is very long, but the output doesn't look that terrible;
Indeed.
Looking at Microsoft's repos for SARIF:
https://github.com/search?q=org%3Amicrosoft+sarif
It seems very C# oriented, but there is at least Python code for the object model:
https://github.com/microsoft/sarif-python-om
As far as converters and formatters go they have only one of each, one in TypeScript and one in Python.
https://github.com/microsoft/axe-sarif-converter https://github.com/microsoft/bandit-sarif-formatter
PS: I discovered a related protocol SASP, for communication between static analysis servers and the post about it mentions a pylint SARIF converter and some work on adding SARIF output to the Clang Static Analyzer. They also have a (presumably private) way to capture valgrind output and gdb stack traces too:
https://blogs.grammatech.com/static-analysis-results-a-format-and-a-protocol-sarif-sasp https://github.com/GrammaTech/pylint-sarif
I think a probably useful thing to add would be to fork the Firehose Python code into a tool-output-to-sarif codebase that would simply parse the output and place results into the SARIF Python object model.
-- bye, pabs
What would be purpose of the firehose codebase be in this case, compared to individual formatters like e.g. https://github.com/microsoft/bandit-sarif-formatter ?
IIRC the existing Firehose codebase mostly consists of converters from the output of tools that don't support Firehose to the Firehose format.
Formatters (either separate like that one or preferably built-in) are the better options than converters because the conversion happens internally and can potentially provide more information.
The value of the existing Firehose converters is that they are a collection of output converters for tools that don't support Firehose, porting the codebase to SARIF would mean quickly adding converters for many tools at once. This would be a quick way to get more SARIF coverage before starting to add SARIF support to each tool, waiting for it to get merged/released and so on. I think this is why the Firehose folks went this way; it made it quicker to get more tool coverage, despite the downsides of converters vs formatters.
-- bye, pabs
Is it useful to have all converters in the same repo though, rather than separately? e.g. https://github.com/microsoft/axe-sarif-converter is a converter rather than a formatter as well.
As a side note, this is an interesting list of static analysis tools https://github.com/codeconditioner/awesome-static-analysis
Personally I tend to prefer lots of similar things to be maintained in one place, Linux kernel drivers, scanner drivers or even cats recipes. It is simpler for packaging and provides better maintenance.
That awesome list is a fork of a repo already mentioned in doc/TODO.
-- bye, pabs
That makes sense to me. I'm mostly skeptical that there is much to be salvaged from the firehose codebase; its data model is different from SARIF's, and the project itself is dead.
The SARIF SDK already appears to ship a number of converters for other things, see: https://github.com/microsoft/sarif-sdk/tree/master/src/Sarif.Converters, including for cppcheck, gcc and clanganalyzer (and there appear to be some references to SARIF in the clang source code, does it have native support now?).
frama_c now natively supports SARIF (https://frama-c.com/Changelog.html)
Of the parsers shipped by firehose (https://github.com/fedora-static-analysis/firehose/tree/master/firehose/parsers), that leaves:
OK, thanks for the extra research.
Some takeaways I've made from this:
-- bye, pabs
Hope you don't mind my 2-cents.
I'm looking for ways to get more static analysis in my CI pipelines, and often being on a GitLab server, I really like their "Code Quality" feature which runs Code Climate scanners. But those require Docker-in-Docker, which is annoying.
I've kept CATs in the back of my mind for a while as a potential tool to look into. And having machine readable output for consumption in GitLab, which I can then scroll and click through, would be great. I would be happy to setup my own Docker container with a handful of applicable scan tools and use CATs to execute them all.
So a long-winded +1 for machine readable output. Converting from something like SARIF to the Code Climate JSON report format should be pretty trivial.
Thanks for your interest in CATS and this potential feature.
The open question for machine-readable output is how to do it.
Lets explore the problem space a bit more.
Some commands will support SARIF natively, probably via an additional command-line option.
Some commands will have available converters from their native output format to SARIF.
Some commands will have a standard format like "file:line:col: msg" that might have a SARIF converter.
Some commands will have no way to convert their output to SARIF.
So I think we need in CATS these things:
In the command field, a way to enable a specific option when a particular output field is requested. Maybe {sarif}{/sarif} or maybe two checks and a flag, or maybe both will be needed.
A set of flags that indicate the output type, with enough options to cover all of the above cases. Or a separate output type field?
Code to generate SARIF output that says "tool foo output something, see plain text file bar for details".
A command-line option to select the output format, enabling the above things for each check when SARIF is enable.
Probably some sort of parsing and writing of SARIF to be able to combine multiple SARIF documents.
A list of converters between different formats, including from various formats to SARIF and maybe also converters to other formats, but really things like Code Climate should switch to reading SARIF.
The above is a design sketch. I'm unlikely to spend time on it other than reviewing/testing patches, unless someone has an ongoing funding stream that could be used for CATS work including the SARIF features. Hopefully we can find a way to achieve this. There are lots of other CATS features that could use help or funding to achieve too. So far I have failed to find funding to work on CATS and very few people seem to care about CATS and there are many many many CATS competitors (some use SARIF already) so I haven't had much motivation to do anything on CATS.
-- bye, pabs
One other benefit of the above design sketch is that we also get a way to handle commands that have output that isn't suitable for output to a terminal. Like the commands that produce XML and that CATS currently has to add an XML to plain text converter to the command field.
-- bye, pabs
It would be great if check-all-the-things supported a machine-readable output format.
In particular, it would be useful if there was a consistent way of reporting: