Closed aconrad closed 1 year ago
Is there a way to formulate bash commands for the individual statistics that would be much faster, e.g. Filename could be extracted via:
grep -o 'filename=[^ ]*' coverage_before.xml | cut -f2 -d=
? I think this would help in getting a more efficient implementation.
Is there a way to formulate bash commands for the individual statistics that would be much faster, e.g. Filename could be extracted via:
grep -o 'filename=[^ ]*' coverage_before.xml | cut -f2 -d=
? I think this would help in getting a more efficient implementation.
Hey @gro1m, that's a good idea. Although I think the issue is that XML attributes aren't guaranteed to be in order in the XML tag so having a fixed field number like you have could result in getting the wrong data.
I played around with it and loading the XML isn't too bad (about a second on my laptop). But I noticed that calling a def total_...()
method takes forever. I didn't have time to look closer into it. I also know that i have a fixme/todo comment somewhere in the code to make the code O(1). I'm on my phone now so it's inconvenient for me to look at it but there are pointers if you are interested.
Another approach I have in mind is the way we parse the XML data using xpath. While xpath has a convenient interface for finding what we care about in the document, calling it is not cheap. On the other hand, lxml has a SAX interface (they call it the target interface) for reading the data. It allows us to iterate over the XML once all while receiving events that the parser emits that let's us know where the parser is at and deciding what bits of parsed data we want to keep or ignore. For example we could hold a dict with filenames as keys and coverage statistics as values. Then when reports are generated, we access the generated dict to access the stats instead of calling xpath. We'd have to write more code to handle that way of parsing but it's documented here: https://lxml.de/parsing.html#the-target-parser-interface
Might be worth considering.
What seems rather quick is the following code snippet I tried via python -I
(this though only extracts filename):
>>> from xml.dom import minidom
>>> import os
>>> os.listdir()
['coverage_after.xml', 'coverage_before.xml']
>>> dom = minidom.parse('coverage_before.xml')
>>> classes = dom.getElementsByTagName('class')
>>> for c in classes:
... print(c.getAttribute('filename'))
Ah ok I see, the problem is not loading the xml, but deriving the statistics from it...
Still, we can get coverage via line-rate:
for c in classes:
print(c.getAttribute('line-rate')
and total statements via numbers:
line = dom.getElementsByTagName('line')
for l in line:
print(l.getAttribute('number'))
print([l.getAttribute('number') for l in line if l.getAttribute('hits')=='0']) # for uncovered lines
print([l.getAttribute('number') for l in line if l.getAttribute('hits')!='0']) # for covered lines
These computations seem all to be very fast and from there we somehow should derive the statistics. The only thing needed is to allocate the numbers to the right key (e.g. filename)...
Hi!
The most hot methods of Cobertura class are
Just replaced them with a map
@oev81 Can you make a PR and show the performance increase you get? - thanks!
@gro1m in my case, conversion takes now 4-6s instead of 155s.
pycobertura 3.0.0 was released
This archive contains 2 really large coverage files and it take ~17+ minutes to parse them on an Apple M1 Max.
https://github.com/aconrad/pycobertura/files/8494425/Archive.zip