Open bertfrees opened 7 years ago
Identify what the critical scripts are
This is what we got from the feedback:
Collect test cases
I have test cases from SBS (that I can unfortunately not share). @josteinaj has test cases for NLB. I suggest we collect some info about the books (the names and the sizes) and we make an overview here of the time it takes to finish their conversions, for different revisions.
I've shared NLB books here: https://github.com/snaekobbi/profiling/tree/57ba78c3740cd595eb76cc36887464d69f722a3d/resources/size-test
If you want to share SBSs books (in a test suite etc) you can use this XSLT: https://github.com/snaekobbi/profiling/blob/57ba78c3740cd595eb76cc36887464d69f722a3d/resources/size-test/randomize-xml-text-content.xsl
The plan is now to try to identify more expensive steps, similar to how Jostein has done it before, by inspecting time differences between log/progress messages (from Pipeline log and/or Calabash's profile file).
The css:split
step is clearly where the biggest room for improvement is right now (even though Jostein has optimized this step already). Maybe this step needs a Java implementation.
In a test book that I converted in 820 seconds, 724 seconds were spent in the css:split step.
- Is our middleware a bottleneck? Does Saxon EE help?
We haven't tried Saxon EE. Moving from Calabash to Morgana would be too much work at the moment for probably little gain.
- Is there some bad coding (or rather, code that is not being optimized by runtime engine)?
- Is there a more fundamental problem with the used technologies (upper limit on performance)?
I think we can conclude that the bad performance is mainly due to unoptimized code, but that there is also a more fundamental problem with the used technologies XProc and XSLT (the fact that we need to reïmplement some steps in Java proves that). Reïmplementing the css-to-obfl step as a whole in Java is a possibility I keep in the back of my mind, but I only consider that as an ultimate optimization for situations where performance becomes really important. (Reïmplementing would be much easier than starting from scratch because we already have a lot of tests and understanding of the CSS model, but would be a big job nevertheless.)
The second performance problem is high memory usage.
Identified issues:
[x] Running big XProcSpec test suites (e.g. pipeline-mod-sbs, pipeline-mod-dedicon) is very slow and may lead to OutOfMemoryErrors. My guess is that the slowness is due to the increased CPU usage that goes with it which could indicate excessive page swapping. Another possibility is garbage collection. The fact that the OutOfMemoryError or increased CPU activity happens after running a number of tests suggests that this is some kind of memory leak because in theory no big state should be kept in memory in between conversions. It is unclear whether this is an issue in the test framework or in the Pipeline itself. In case of the latter it would mean a Pipeline server would get out of memory after running a number of jobs. The next issue seems to confirm this.
This problem seems to be unrelated to the problem in XProcSpec (which has been fixed already) that made tests very slow because the log file had to be loaded and stored a lot of times (see https://github.com/daisy/pipeline-tasks/issues/86).
[x] In the past when intensively testing dtbook-to-pef the server would become very slow after a number of jobs and would come to a halt.
[x] https://github.com/snaekobbi/pipeline-mod-dedicon/issues/58
[ ] https://github.com/daisy/pipeline/issues/207
Getting the web service to hang is easy to reproduce in the sense that it happens fairly often, after just a few books sent through, but hard to reproduce in that it may not happen in the same place, with the same book. One situation that appears could trigger this problem is sending through a large book that exhausts the heap.
Memory issues are more difficult to pinpoint because you can't just look at time stamps of log messages. I'm not sure how to approach this. Analyzing the memory usage with YourKit could be a first step: https://www.yourkit.com/features/#memory.
re: Running big XProcSpec test suites
All scenarios inside a single xprocspec document are run (through p:eval
) from the same calabash instance, and the inputs/outputs/expectations are all collected in one big XML document which is used for evaluating the results and generating the report. This intermediate XML document could get pretty big if there's a lot of documents used as inputs/outputs/expectations, so for xprocspec documents that take a long time to run it could be worth trying to split it into smaller documents and comparing whether or not that speeds things up.
I think I tried that already and it didn't make a difference. The SBS test suite is already split up in multiple files and the individual files and their inputs and expected outputs aren't even that big. Not enough to explain the several hundreds/thousands of Mb that gets consumed I think. The dotify-formatter test contains 100 scenarios and that one works fine by the way. Also the fact that the consumed memory keeps increasing over several files doesn't match your explanation.
I think I'm finally on to something. I think the problem might be that Calabash stores XProcRuntime instances in a class called XProcRegistry and they aren't always cleaned up properly.
If you look at the history you see that there has been a lot of fidgetting in this area but things are still not right. In chronological order:
@rdeltour Wow, there appears to be a 6-year old pull request by you that never got merged and that could fix an issue that I'm facing after upgrading Calabash. Still trying to figure out how it could even work without your patch 😕.
The first memory issue (running big test suite) is solved! Turns out that the issue I had already mentioned in https://github.com/ndw/xmlcalabash1/issues/82#issuecomment-15068649 had been the cause all along. It's still not super fast, and the consumed memory still keeps increasing somewhat, but I can at least run the test suite again. This calls for a celebration!
I'm just going to make a PR of what I have done so far and defer the remaining issues to a later time. I'm also going to make a PR to Calabash upstream with my quick fix and then also ask if this can be fixed properly, e.g. with the help of Romain's old PR.
in August we should test whether https://github.com/daisy/pipeline/issues/207 is still an issue.
There is also this issue of hanging dtbook-to-odt jobs when math enabled. I haven't been able to reproduce it yet. (EDIT 11/9: this has been fixed in the meantime.)
After fixing the memory leak in Calabash, the braille scripts perform way better. However speed is still not optimal. For example the dtbook-to-pef script pales compared to the speed of Dotify. Whether a job takes a few seconds or a few minutes makes a big difference when you want to run it repeatedly after editing the input.
We should continue looking for possible optimizations. Identifying the pain points should become easier after the progress messages have been implemented.
To do: create actionable items.
See also: