Can we improve parsing performance?

aconrad commented 2 years ago

This archive contains 2 really large coverage files and it take ~17+ minutes to parse them on an Apple M1 Max.

$ time pycobertura show coverage_before.xml
Filename      Stmts    Miss  Cover    Missing
< ... snip ... >
3810.cpp          8       8  0.00%    57-66
3811.cpp         31      31  0.00%    13-78
3812.cpp          5       5  0.00%    13-20
3813.cpp         30      30  0.00%    13-73
TOTAL        642138  624616  2.73%
pycobertura show coverage_before.xml  1017.96s user 36.62s system 99% cpu 17:38.20 total

https://github.com/aconrad/pycobertura/files/8494425/Archive.zip

gro1m commented 2 years ago

Is there a way to formulate bash commands for the individual statistics that would be much faster, e.g. Filename could be extracted via:

grep -o 'filename=[^ ]*' coverage_before.xml | cut -f2 -d=

? I think this would help in getting a more efficient implementation.

aconrad commented 2 years ago

Is there a way to formulate bash commands for the individual statistics that would be much faster, e.g. Filename could be extracted via:
grep -o 'filename=[^ ]*' coverage_before.xml | cut -f2 -d=
? I think this would help in getting a more efficient implementation.

Hey @gro1m, that's a good idea. Although I think the issue is that XML attributes aren't guaranteed to be in order in the XML tag so having a fixed field number like you have could result in getting the wrong data.

I played around with it and loading the XML isn't too bad (about a second on my laptop). But I noticed that calling a def total_...() method takes forever. I didn't have time to look closer into it. I also know that i have a fixme/todo comment somewhere in the code to make the code O(1). I'm on my phone now so it's inconvenient for me to look at it but there are pointers if you are interested.

aconrad commented 2 years ago

Another approach I have in mind is the way we parse the XML data using xpath. While xpath has a convenient interface for finding what we care about in the document, calling it is not cheap. On the other hand, lxml has a SAX interface (they call it the target interface) for reading the data. It allows us to iterate over the XML once all while receiving events that the parser emits that let's us know where the parser is at and deciding what bits of parsed data we want to keep or ignore. For example we could hold a dict with filenames as keys and coverage statistics as values. Then when reports are generated, we access the generated dict to access the stats instead of calling xpath. We'd have to write more code to handle that way of parsing but it's documented here: https://lxml.de/parsing.html#the-target-parser-interface

Might be worth considering.

gro1m commented 2 years ago

What seems rather quick is the following code snippet I tried via python -I (this though only extracts filename):

>>> from xml.dom import minidom
>>> import os
>>> os.listdir()
['coverage_after.xml', 'coverage_before.xml']
>>> dom = minidom.parse('coverage_before.xml')
>>> classes = dom.getElementsByTagName('class')
>>> for c in classes:
...         print(c.getAttribute('filename'))

gro1m commented 2 years ago

Ah ok I see, the problem is not loading the xml, but deriving the statistics from it...

gro1m commented 2 years ago

Still, we can get coverage via line-rate:

for c in classes:
     print(c.getAttribute('line-rate')

and total statements via numbers:

line = dom.getElementsByTagName('line')
for l in line:
    print(l.getAttribute('number'))
    print([l.getAttribute('number') for l in line if l.getAttribute('hits')=='0']) # for uncovered lines
    print([l.getAttribute('number') for l in line if l.getAttribute('hits')!='0']) # for covered lines

These computations seem all to be very fast and from there we somehow should derive the statistics. The only thing needed is to allocate the numbers to the right key (e.g. filename)...

oev81 commented 2 years ago

Hi!

The most hot methods of Cobertura class are

_get_class_elements_by_filename
_get_class_element_by_filename

Just replaced them with a map

gro1m commented 2 years ago

@oev81 Can you make a PR and show the performance increase you get? - thanks!

oev81 commented 2 years ago

@gro1m in my case, conversion takes now 4-6s instead of 155s.

oev81 commented 2 years ago

PR https://github.com/aconrad/pycobertura/pull/151

aconrad commented 1 year ago

pycobertura 3.0.0 was released

aconrad / pycobertura

Can we improve parsing performance? #149