Analyser output is almost useless on large files

Quuxplusone commented 16 years ago


Bugzilla Link	PR2879
Status	CONFIRMED
Importance	P normal
Reported by	Roger Binns (clang@rogerbinns.com)
Reported on	2008-10-10 16:23:05 -0700
Last modified on	2010-02-22 12:55:21 -0800
Version	unspecified
Hardware	All All
CC	clattner@nondot.org, kremenek@apple.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

My project includes the SQLite amalgamation - see
http://sqlite.org/download.html and http://sqlite.org/amalgamation.html

The current amalgamation is 97k lines and 3.2MB.  The analyser finds about 20
issues in it, most of which are false positives due to lack of inter-procedure
analysis (it can't tell that various error flags are set preventing later code
paths claimed to be a problem).

The annotated source output is always of the whole file and ends up being 14MB
of HTML.  The browser takes almost two minutes to load the file and is mostly
unresponsive the whole time.  This makes it very tedious and time consuming to
examine the reports.

One suggested improvement for large files is to include the function(s) where
the issue is and a few hundred lines before and after in the annotations which
will drastically shrink the html files.

Secondly it would be very helpful if the table in index.html also included the
function name.  That way I could tell if a report is a new issue or an existing
issue without having to click on the report itself and wait several minutes.

This also interacts with issue #2828.  To get a real world example of how
painful all this is in action, do the following:

$ svn checkout http://apsw.googlecode.com/svn/trunk/ apsw
$ cd apsw
$ python setup.py --fetch-sqlite   # gets the sqlite amalgamation
$ cp sqlite3/sqlite3.c .           # puts in cwd
$ ./clangme.sh                     # read and adjust contents as necessary

Quuxplusone commented 16 years ago

(In reply to comment #0)
> My project includes the SQLite amalgamation - see
> http://sqlite.org/download.html and http://sqlite.org/amalgamation.html
>
> The current amalgamation is 97k lines and 3.2MB.  The analyser finds about 20
> issues in it, most of which are false positives due to lack of inter-procedure
> analysis (it can't tell that various error flags are set preventing later code
> paths claimed to be a problem).

The lack of inter-procedural analysis is a serious problem for C code, as the
interfaces typically aren't as strong.  It's something that we will get to
eventually; I understand that it is a serious shortcoming that needs to be
addressed.

>
> The annotated source output is always of the whole file and ends up being 14MB
> of HTML.  The browser takes almost two minutes to load the file and is mostly
> unresponsive the whole time.  This makes it very tedious and time consuming to
> examine the reports.

This is definitely a problem with the current reporting mechanism.  Even if the
original source file is not that large, reporting a larger number of issues
over a big project can be equally detrimental.

Having scan-view being a dynamic process (rather than just using static html
files) opens up some new possibilities.  I'll comment on those below.

>
> One suggested improvement for large files is to include the function(s) where
> the issue is and a few hundred lines before and after in the annotations which
> will drastically shrink the html files.

This solution will work well when we consider just intra-procedural checks.

We can do something like a transitive closure: include all the
definitions/declarations of functions, variables, and types referenced by a
given function (although we would limit ourselves to the source code of a given
file).

Ultimately, the best solution is to make the generation of the HTML purely
dynamic.  This allows users to explore function bodies with a click of the
mouse, navigate to other files, etc.  This will be especially important when we
start to do inter-procedural checks.

To accomplish some of these goals, the output of scan-build should no longer
generate HTML files, but two things instead that allows the HTML content to be
dynamically generated:

(a) A condensed representation of the bug report (source locations, message
bubble locations, etc) that is written to disk (a single file?)

(b) A condensed representation of the source files needed to represent the bug
reports when they are generated to HTML.  We need to make copies because the
original source can be modified by the user after doing an analysis run.  One
possibility is to serialize out the token stream using the LLVM bitcode
representation.  This is something we have been thinking about anyway for other
purposes.

>
> Secondly it would be very helpful if the table in index.html also included the
> function name.  That way I could tell if a report is a new issue or an
existing
> issue without having to click on the report itself and wait several minutes.

Agreed.  Eventually the entire GUI needs to be reworked so that users can far
more nimbly go through the error reports.

Quuxplusone commented 16 years ago

(In reply to comment #1)
> Having scan-view being a dynamic process (rather than just using static html
> files) opens up some new possibilities.  I'll comment on those below.

I am very happy with the html output just so long as there is a lot less of it.
It is also possible to email them etc which makes life easier when reporting
issues upstream.

A dynamic scan-view would probably be useless to me.  The analyser takes a long
time to run on my single file (which #includes other files).  In fact I
discovered that if the analyser is compiled in debug mode it would take almost
to the heat death of the universe to complete :-)  Not quite, but a long time
so I have to compile it in Release/optimised mode.

> We can do something like a transitive closure: include all the
> definitions/declarations of functions, variables, and types referenced by a
> given function (although we would limit ourselves to the source code of a
given
> file).

I don't need all that gunk in the output HTML just the function and a small
amount of surrounding context with the trail of the alleged problem.

> Ultimately, the best solution is to make the generation of the HTML purely
> dynamic.

That may be the best long term solution, but in the short term I need to email
the issues to others which needs static output.  The recipients and I can do
any declaration lookups ourselves.

> Agreed.  Eventually the entire GUI needs to be reworked so that users can far
> more nimbly go through the error reports.

Also remember the use case of sending information to others :-)

Quuxplusone commented 16 years ago

The bulk of time spent in the analyzer is not in the HTML emitter. I believe Ted's proposal is to have the actual checker run once, emit info about problems found into a "database" and then have a dynamic system that just formats the HTML according to your constraints. I believe that this will solve the problem...

Quuxplusone commented 16 years ago

(In reply to comment #3)
> I believe
> Ted's proposal is to have the actual checker run once, emit info about
problems
> found into a "database" and then have a dynamic system that just formats the
> HTML according to your constraints.

That is pretty much how Coverity and similar tools work.  On the bonus side
their "database" is persistent so you can mark items as false positive or
wontfix and it won't bug you about it again on the next run (and also copes
with the code moving around a bit during the course of normal development).
Each reported issue also has a persistent id so you can link to it from your
bug tracking system.

However that is all longer term work.  I would love a short term hack that
omits the HTML output of lines more than 100 away from the actual issue,
perhaps automatically triggered if there are more than n lines in the file,
with a suggested value of 2,000 for n.

Quuxplusone / LLVMBugzillaTest

Analyser output is almost useless on large files #3154