Open ax3l opened 7 years ago
Hi, you might want to try changing the cache that Breathe uses. By default it just stuffs all the results of parsing files into a dictionary which is created here:
https://github.com/michaeljones/breathe/blob/master/breathe/parser/__init__.py#L99
The cache was introduced to speed things up but realistically it might get very big. If you replace that dictionary with an object that has a dictionary interface but never stores any data then you might have luck reducing the memory usage. It is possible that some kind of LRU cache might represent the best of both worlds but I've no idea what the cache hit/miss pattern would be so it is hard to say.
Also, thanks for using Breathe and for raising this issue! :)
Do you have any results from this at all? Happy to introduce a basic cache/no-cache flag to the main configuration options to help people deal with this.
I did not try to modify the cache yet but are now at a point where even selecting a few dozen classes (out of many 100; number of doxygen xml files is: 2791) in a page via doxygenclass::
becomes so memory-hungry that RTD's 1GB virtual memory limit gets broken... :(
update: I tried skipping the cache build at all which makes the build slow as expected...
digging a bit deeper, the underlying problem seems to be that single entries in the cache can easily get up to MBytes in size.
The main consumers seem to be compound.(member)RefTypeSub
/ compound.listofallmembersTypeSub
from a first glimpse.
the interesting thing is that the .xml
files that doxygen creates, with all their well-known formatting overhead, only sum up to 84 MByte
for the project I am reporting. I wonder where all the overhead is coming from?
I think the main issue is the memory footprint of the xml parser minidom
, which's parsed result object is cached.
I think we have two ways to speed things up while keeping a cache.
A) keep everything in place but cache the raw file and parse on every access in-RAM B) replace the xml parser with something that does not introduce a memory overhead of 20x while reading a text file
I will hack around a bit and report back.
Since all the reports on minidom
are discouraging (huge overheads in general, supposedly also quite some memleaks) from what I found on stackoverflow, I started investigating for B).
I tried starting to replace minidom
with ElementTree which can potentially also be drop-in replaced later with lxml(2)
for speed.
Did not succeed yet fully, but here is the branch so far: (progress logged in commit message) https://github.com/ax3l/breathe/tree/topic-xmlElementTree
@michaeljones although I don't have the time to work on this deeper (sorry!), I think this issue should become some priority for the project since it needs coordination and is quite limiting for deployments on e.g. readthedocs.
I think the overall project should migrate away from the memory-leaking and slow minidom
(see comment+links above) to ElementTree
/lxml
as soon as anyone can work on that :)
@svenevs since you are working on exhale
, did you see the same performance issues with large projects with breathe? :)
My try to get rid of minidom
in breathe are quite stalled but I documented the progress above and it's quite an issue on read-the-docs with limited resources. Just wanted to ping you.
Hehe. LOL well Exhale is reliably slower for different reasons.
Here's a braindump. It is probably more information than you want...
I'm finishing up some touches to upload it to pip
(finally, initial upload had a small bug and apparently I did get templates working).
The "legacy" exhale was done in one file, and (through dubious __dict__
crawling of breathe
) explicitly crawled the breathe
API to infer relationships. In the rewrite, I switched to BeautifulSoup with an lxml
backend. The reason I bring this up is because I'm pretty sure the only reason "legacy" exhale even worked is because breathe
loads the ENTIRE XML DOM into memory. Don't quote me on that because I really have no idea. I was never able to figure out how to extract the actual documentation throughbreathe
compounds, so this could also indicate that the entire XML DOM is not loaded into memory.
breathe
, because the generated API was constructed to match the Doxygen hierarchy. AKA the parsing shouldn't need to change.breathe
is an exercise in futility, as the generated super stuff is already in play.
breathe
already has the full XML specs represented.lxml
is already installed. It does add some potential incompatibility (for Windows in particular) when doing pip install breathe
, but you can also fallback on minidom
if not present.
breathe
, setup.py
basically would need to try
- except
importing lxml
before specifying the requirements for installation. I don't think there is a way to try installing lxml
and then fallback though, which leaves people vulnerable to having to first pip install lxml
and then pip install breathe
. Presumably most people will overlook that, even if its documented.except
just import the existing minidom
or whatever.In Exhale, I was very cautious to not build out the entire graph while loading in all XML documents. I know for a fact that I can get away with this, but I'm not sure that breathe
can. The general workflow was
index.xml
and start creating nodes to represent the graph.index.xml
is closed, individual (e.g., class or namespace or file) XML documents are opened, parsed, and closed.
programlisting
. That needs to go away at some point.So basically, the question for breathe
really is: does the ENTIRE XML DOM get loaded into memory at once? The above approach works because index.xml
enumerates every possible type in the whole framework. Then for things where <refid>.xml
exists, you can use that to get more information such as base or derived classes, documentation of specific refid
's etc.
I think I'm ready to re-upload to pip
tonight, I'll post back when I have because Exhale prints out the times it takes to parse and generate the rst docs, so it may be useful for you to compare that to breathe times. To test the breathe timing, though, you'd need to be working with a Development Install and inject some of your own benchmarks.
Note: I don't actually have any memory benchmarks for Exhale, only timing. I only know how to do memory benchmarks in C++...
Thanks for the detailed answer!
does the ENTIRE XML DOM get loaded into memory at once?
the main problem is not that the entire XML dom gets loaded at once. that is fine and efficient for access to it and even large projects seldom generate more than 100 MB of plain XML (that is smaller in RAM when parsed). The main problem is that minidom
is used for parsing, which is cluttered with memory leaks and blows it up more then ten-fold in RAM.
Ahhh Ok I see. So you have two problems:
doxygenindex
specifically being the culprit directive.If the API is stable enough, did you try using breathe-apidoc
to generate the structure?
TBH though I tried giving exhale
a crack at it and I definitely don't think it'll work for you. (Note: I did something really dumb which is why the requirements.txt
in that is using the github tag). this was a QUICK test on my part, i hit one error and quit...your project is giant.
del
stuff explicitly.
Exhale finishes it's part, but the build itself after fails. Could be a me-bug, but I'm pretty sure it's breathe
and templates. The build failure for me was
Exception occurred:
File "/usr/local/lib/python2.7/site-packages/docutils/nodes.py", line 582, in __setitem__
self.children[key] = item
IndexError: list assignment index out of range
Anywho, I guess the point is both seem to consume a LOT of memory when parsing the Doxygen XML for your project. For Exhale, I guess I have to lay blame on BeautifulSoup
(slash me not explicitly deleting things), and being generally wasteful about what I'm storing. For small and medium sized projects it seems ok, but yours is way too complex :cry:
Thx for giving it a try! Yes, I think the underlying problem is actually just 1.
since each and every minidom
XML read is memory hungry/leaking. doxygenindex
just happens to access all XML data, but using many macros of all other kinds leads to the same RAM expense (I already managed to cross the 1GB RTD limit with just documenting a few 100 classes & files of ours).
My minidom
replacements above toward ElementTree
already reduced memory footprints significantly in my tests. I am just not well enough into breathe's architecture to finish it to finally work again with all functionality :'(
@svenevs @michaeljones @SylvainCorlay did by coincidence any of you try to reduce the memory footprint of breathe? On RTD, I get regularly killed in builds due to the leaks in minidom
:'(
I tried above to get rid of minidom
but I don't understand the project deep enough to get it refactored, unfortunately... any form of help / PRs to lighten the burden would be deeply appreciated :)
So if we're convinced that minidom
is the problem, switching to lxml
is probably the best course of action. Doing so will not be easy, though. The original code base was generated using generateDS
, how it got setup originally is described here, though I'm not sure what files were used to generate it in the first place. Based on the current parser setup, I believe you only need to use index.xsd
and compound.xsd
. You can obtain those by just generated Doxygen XML documentation once, they'll show up next to the other XML documents (e.g. right next to index.xml
).
I think you may be able to regenerate using lxml
, but then you'd need to crawl through the diffs and figure out what changed. The generateDS
docs say it requires lxml
, that may not have been the case when breathe was first created. It's not clear to me what the answer is, but I think there is a way to possibly have both available in the same code base. I'm looking here. But I also think that requires that breathe had been generated using export
, which I don't think is true.
Myself, I would vote for (if it actually works), replacing minidom
with lxml
entirely. If I remember correctly, minidom
is pure python, whereas lxml
has a C backend (that is a belief, not a fact).
So to do it, I think the steps you wold need to take are
generateDS
, making sure that the lxml
stuff actually gets created. There's likely more than just setting export
to etree
, you'd have to read the docs there more carefully.Then I think all that changes is this method to use lxml
instead?
I don't know if the parser
package can just be replaced though, there's likely a lot more to it :/
thanks for the information, that is super useful!
I gave it a little try and indeed in recent versions generateDS depends on lxml with a fallback to ElementTree :)
@michaeljones do you recall or did you save the original call to generateDS
for the parser creation? Especially the super-class generation I don't understand how they got created in the first place. Did you take any specific Doxygen project for the generation? (Mine only contain a compound.xsd
and an index.xsd
...)
I tried so far:
generateDS -f -o compound.py -s compoundsuper.py --super="compoundsuper" ~/src/picongpu/docs/xml/compound.xsd
generateDS -f -o index.py -s indexsuper.py --super="indexsuper" ~/src/picongpu/docs/xml/index.xsd
which at least updates me the four files in breathe/parser/, but not the __init__.py
etc.
But installing and trying it leads to errors of the form
File ".local/lib/python2.7/site-packages/breathe-4.7.3-py2.7.egg/breathe/finder/core.py", line 29, in create_finder
return self.finders[data_object.node_type](self.project_info, data_object, self)
AttributeError: 'DoxygenType' object has no attribute 'node_type'
I'm afraid I'm not up to date on this thread yet, though I will try to catch up soon. I don't believe I have the original generation command saved as the generated files have long since been edited due to parts that felt that they were missing somehow. I think there are levels of indirection in the xml that weren't handled well so manual adjustments seemed necessary and at the same time the doxygen xml seemed stable enough that regenerating from a new schema was not going to be a win.
I believe that the generated files used to have the command at the top, or perhaps just a comment indicating that they were generated. I removed it when it became apparent that I wasn't going to regenerate them. It might still be available in the source history though.
Thanks for catching up! :)
Unfortunately, the old commands only have a Generated Mon Feb 9 19:08:05 2009 by generateDS.py.
on top. Newer versions of generateDS add the full command line that was used for creation, including version information.
I gave it a little try and indeed in recent versions generateDS depends on lxml with a fallback to ElementTree :)
Sweet. I believe what this means is, if this is all successful, breathe
would not need to depend on lxml
(likely more Windows friendly). But assuming it tries to use lxml
first, it will definitely use that on RTD! If this all works out we can add to the installation guide try to pip install lxml
first, but if you cannot get it working breathe will still work. I can help test that down the road.
File ".local/lib/python2.7/site-packages/breathe-4.7.3-py2.7.egg/breathe/finder/core.py", line 29, in create_finder
return self.finders[data_object.node_type](self.project_info, data_object, self)
AttributeError: 'DoxygenType' object has no attribute 'node_type'
That is part of "step 2", sorry if that wasn't clear. You'll have to manually splice that all back in to the super / sub classes :scream: E.g.
That guy is the only thing that actually tells the rest of breathe what this is (either for directives, signatures, or rendering things like #include <path>
).
Also, I'm not sure if you are aware, but you can do "editable" installs with pip
which will make this much easier for you to develop than continually reinstalling.
# In directory
# /some/path/breathe
# - setup.py
# breathe/
# parser/ ...
$ pip install -e .
The egg link is generated in the local directory, and you can keep updating the files here without needing to re-install.
That is part of "step 2",
I think the main work now is, after replacing the old generated files with newer ones, to go to manually written files like the That was too late at night, reading your comment again makes sense, it's just an other attribute in the generated files. Making small progress...finder
. We there have to make sure they do not use old "minidom" type members such as node_type
but the name member attributes exposed by the new (lxml/elementtree) parser. Side note: e.g. node_type
is just the tag name as it looks so far.
Btw, found the changelog of generateDS (probably something like 1.18 in 2009 to 2.29 today.
I have currently added all required node_type
attributes and am now swiping through side-effects.
I am not fully building, but from a first test the memory consumption is still really bad... https://github.com/ax3l/breathe/tree/topic-elemTree
Are you able to quantify "really bad"? Is it still greater than 4GB?
Also, what version of python are you using? In compound.py
, the docTitleType
has 270 parameters. More than 255 was introduced with python 3.7.
Thanks for the hint.
Yes, the memory consumption is still similarly high, at least ~1.9 GByte for parsing of my ~130 MB doxygen XML input (but it does not yet build fully with my changes). I think I never saw up to 4 GByte on my machine, ~2.3 GB is what I see with master
.
I am testing with Python 2.7 and 3.5 - interesting issue with the compound.py
param count!
I am not sure if I posted it already, but this is how one can reproduce it with our data:
git clone https://github.com/ComputationalRadiationPhysics/picongpu.git
cd picongpu
git checkout e3cc934c31020ef19a387a9987fac1f5b7595742
cd docs
doxygen
du -hs xml/
# 132 MB
make html
# njom njom, RAM be gone
Are you working with an installed copy of your breathe/topic-elemTree
branch? That's what I installed and ran your above instructions, and the 255 args thing will arise:
sven:~/Desktop/picongpu/docs> make html
Running Sphinx v1.6.7
making output directory...
Exception occurred:
File "/usr/local/lib/python3.6/site-packages/breathe/parser/__init__.py", line 3, in <module>
from . import compound
File "/usr/local/lib/python3.6/site-packages/breathe/parser/compound.py", line 6809
, emsp=None, thinsp=None, zwnj=None, zwj=None, lrm=None, rlm=None, ndash=None, mdash=None, lsquo=None, rsquo=None, sbquo=None, ldquo=None, rdquo=None, bdquo=None, dagger=None, Dagger=None, permil=None, lsaquo=None, rsaquo=None, euro=None, tm=None, valueOf_=None, mixedclass_=None, content_=None):
^
SyntaxError: more than 255 arguments
The full traceback has been saved in /var/folders/wx/j7610hj10r3_h1xs4xr41bgw0000gn/T/sphinx-err-jai8fnd_.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [html] Error 1
Note: I'm not really sure that going through and fixing this will remedy the memory consumption errors. Did you ever look into the cache stuff? I feel like that may be where the memory overhead is coming from. I'm not familiar with the implementation, but presumably sphinx
is trying to "pickle" everything that is cached. I would not be surprised if that is the real problem here. However, with "regular" breathe
installed it fails to build at a different point, and du -hs build/doctrees
only reports 1.9M. So that may not be the issue either :cry:
I was playing around with generateDS
and hit that and got really sad. It is solvable, but has to be done manually. Interestingly, though, the original parser does not have this problem:
Maybe you can just ignore it (and keep the original classes)? I think maybe breathe
parses the documentation manually rather than relying on this to be represented by the XML? Having these >255 arguments means later in the pipeline somewhere you have to check them all and turn them into markup (I think).
I lost motivation to pursue, but this is how you would solve it. I think there are only two or three classes this applies to, and it's mostly just because it's representing all of the possible markup options from Doxygen. The above error comes from something like this
class docTitleType(GeneratedsSuper):
node_type = "doctitle"
subclass = None
superclass = None
def __init__(self, ulink=None, bold=None, emphasis=None, MANY_OTHERS):
self.original_tagname_ = None
if ulink is None:
self.ulink = []
else:
self.ulink = ulink
if bold is None:
self.bold = []
else:
self.bold = bold
if emphasis is None:
self.emphasis = []
else:
self.emphasis = emphasis
For both compound.py
and compoundsuper.py
, you have to manually change this to be **kwargs
instead of default args, but then I think the rest of it will be OK. By switching to **kwargs
, though, you now have to be worried about whether or not it was specified. Python already has a solution to this though, using dict.get(key, defaultIfNotFound)
def __init__(self, **kwargs):
self.original_tagname_ = None
# I'm 95% sure you can still do kwargs.get(...)
self.ulink = kwargs.get("ulink", [])
self.bold = kwargs.get("bold", [])
self.emphasis = kwargs.get("emphasis", [])
In the sub-class, instead of
class docTitleTypeSub(supermod.docTitleType):
node_type = "doctitle"
def __init__(self, ulink=None, bold=None, emphasis=None, MANY_OTHERS):
arglist_ = (ulink, bold, emphasis, MANY_OTHERS)
super(docTitleTypeSub, self).__init__(*arglist_)
you just forward kwargs
:
def __init__(self, **kwargs):
super(docTitleTypeSub, self).__init__(kwargs)
@ax3l (and anyone else interested) I gave a stab at fixing generateDS
to use a **kwargs
approach, initial work can be found here. The author has been very supportive so far, but he and I are both worried about making a change like this as it could have unexpected consequences.
@michaeljones do you happen do recall when you did generateDS
way back in the day if you ran into issues with >254 arguments? If so, did you just delete those classes? Or has doxygen
updated / added things that make it so that the 254 problem exists now, but did not back then?
Would using doxygen generated sqlite database instead of parsing xml help speed up the generation?
It would probably be quicker than the current method. A proper, efficient, XML parser would be at least (around) as fast as using a sqlite database though.
@melvinvermeeren Our codebase has around 5000 thousand files, as of now this takes around 6-7 hours to complete, so looking for ideas to speed up the process.
Note: https://github.com/mosra/m.css finishes in minutes, I should take a look to see if I can get some ideas from there.
Sorry to revive a very old thread, but I'm trying to debug some breathe performance issues. My sphinx build goes from 20sec to 3min when I add breathe (and one page referencing a few dozen .. doxygenstruct::
). I didn't fully follow this thread but here is my understanding:
There's quite a bit of discussion on replacing minidom, something about some generated code (which I'm not really following), and there was one comment suggesting that disabling the cache actually might help. @ax3l , @svenevs, @michaeljones and anyone else who might be following this thread: do you think 1-2 weeks of dedicated engineering time would be sufficient to solve this problem? Or is it even bigger than that?
I have some person hours I can donate to the project, but not for too long. Where would be the best place for them to start? (parser/init.py line 99 doesn't exist anymore. I didn't see anything obviously doing caching in there).
@vermeeren are you the current maintainer? I saw that you commented on #544. Maybe you know?
Brief response from phone, the snag is that the way breathe represents it's hierarchy internally was using something to take the doxygen xml schema and turn it into Python classes (generateDS).
For as close as I got, I don't think it's a great path forward. I think there's enough test cases in place to know if things fully break, meaning changing something like the backend parsing could be achievable. But I'm not sure how much the rest of breathe needs to change if the gernerateDS graph goes away (all the compound super stuff).
My vote at this point would be using lxml to parse, or beautiful soup + lxml, but beautiful soup may add noticeable overhead. FWIW I heard a tale about lxml and opening too many files at once leading to slowdowns, so if implemented you'd want to be sure to close things as soon as you're done with them.
Changing the xml parsing would be hugely valuable to this project, but I'm not sure how substantive of an effort this will be (current maintainers could give a better idea). If it gets started but not finished I'm open to trying to slowly pick up the slack, this is something we've all desired for a long time but it's a big cookie.
@cheshirekow Although I am indeed the current (primary) maintainer since 2018-ish there are many technical parts of Breathe that I haven't worked with in-depth, which includes the XML parser. Generally speaking I have an idea of what and where but there are many specifics that I don't yet know much of. (Often this is enough to help with issues, contributions and questions, but for plenty of complex topics it isn't.)
So I'm afraid I likely know less than the other participants in this thread and have no idea how much time it would take to implement the changes desired or what you could run into. @michaeljones probably knows most, but considering the years passed since he wrote the current implementation it might be tricky to recall the specifics.
Yes, it has been a long time and the details are quite blurry to me. I've just re-read the thread but don't have anything specific to add.
I realise it is all long overdue. I would be happy enough to attempt it if properly funded. Though my preferences are not really for Python any more. When I think of tackling this, I dream of re-writing some of the parsing & data handling in Rust and expose just enough to the Python layer for it to fill out the structures that Sphinx needs. That would be silly if there was a more drop in alternative for minidom that would solve the issue more cleanly but I'm not active enough in the Python ecosystem to understand that route.
A step forward perhaps, would be for people to provide example code bases to work with so there is something to profile and inspect and then anyone that attempts it to report data and findings back here. I don't really know what the memory tracing and performance tooling is like in Python though.
Some time ago I looked into regenerating the parser with generateDS both because the Doxygen XML has changed with newer versions, but also because it seems to use lxml now. So I think that may be easiest first step.
However, the parser has been manually tweaked since the auto-generation (82 commits have changed the parser/
folder), and many of them add functionality which may or may not be auto-generated nowadays. So my investigations back then recursed into making tests for these things, but we don't really have a proper testing framework for the things in the examples/specific/
folder.
So, assuming we would have development resources, my suggestion for a "roadmap" would be: 1) Make a testing setup (see #733) where we for each test can specify
I took the liberty of profiling breathe, to the best of my ability, to help isolate what's causing such a significant slowdown.
I profiled a private project which takes about 30 seconds to generate doxygen documentation for, but to which the introduction of breathe adds an addition ~4 minutes of build time. Due to its nature, I can't share the project itself, but I can share the methodology by which I went about profiling.
After downloading both sphinx's and breathe's source code into my project's main folder, I introduced cProfile
as a wrapper around sphinx's sphinx/application.py
Sphinx.build
. Any actions which happen outside of that call stack would not have been captured by my profiling, but the profiler did run in the neighborhood of 4 minutes (which lines up with breathe's added overhead).
tottime
represents time spent directly in a function, while cumtime
is the sum of tottime
, plus any time spent in calls to sub-functions.
Based on my profiling, I'm not certain that caching is culprit when it pertains to runtime performance. It seems that minidom's parsing itself is under-performing. This supports the conclusion that replacing minidom may be the best path forward for breathe, though it does suggest that the issue goes deeper than memory mismanagement.
Thanks for sharing those results @DuTogira - definitely interesting to see.
@michaeljones / @svenevs based on this, what's the best path forward? It seems we have multiple avenues, including:
Are either of these feasible avenues forward which we could hope for in the near future?
I'm not sure anyone is in a good position to answer that. It would take some time to assess the current state of the code and figure out the best way forward. To my knowledge, no one active on the project has a full understanding of the code base due to either having not worked on all of it or having forgotten what they once knew.
There are a few of us who could be funded to work on it but otherwise it is for others to pick up. Speaking for myself, I'm afraid that even getting into a position where I could advise on the best way forward would take time that I don't currently have the motivation for.
I'm sorry that these statements are a bit of a bummer. I'm keen to see the project properly funded by the companies that benefit from its existence but I'm not sure how to go about making that a reality so I'm neither moving forward with funding nor with active work in the absence of funding. We do have an open collective if you (or your company should there be one) are in a position to offer funding.
We do have an open collective if you (or your company should there be one) are in a position to offer funding.
Can you tell me more about the funding model via this collective. Can we "buy a feature" or is this a general "support us if you use us" kind of thing. I can certainly request my employer to fund the performance improvements we are looking for and I'm happy to push the buttons to get that started.
We can also fund the development with (some) eng hours to do the work, though that would go best with some partnership to guide the developer on what needs to be done where.
Thank you for asking. We don't have a well established model yet but at least two of our maintainers are able to work under a "buy a feature"approach, I think. I'm not experienced with how those discussions go or how to price the work but it would definitely be something that we'd be happy to explore.
We'd also be happy to have "pay it if you use it" funding and put those funds towards specific feature work but I can understand if that seems like less of a direct return for some business interests.
If anyone is still interested in lxml I started a PR for it #891 It's not complete but unit tests are passing.
I am trying to set up automated build of our docs on rtd, which is limited to 1 GB mem usage (and 15min build time).
The sphinx step
is at this point a rather large bottleneck. Is it possible to somehow limit the memory consumption of it? (Is this even a breathe issue or a sphinx issue?)