Closed dmbaturin closed 3 months ago
Do you have a ready example of 13MB input, or instructions for generating it?
I'm not familiar with the output of either tool. What is being shown here? Is valgrind showing process heap usage, in terms of how much heap space has been requested from the kernel? The C heap usage? The OCaml heap usage? Would we see OCaml garbage collection in valgrind output?
The graph in the memtrace output has no label on the Y axis. Beneath, it says "allocations" in a sentence. Is it showing heap usage at each point, or allocations performed so far? I assume the former, since the curve sometimes goes down. This would suggest a memory leak.
The manual file is 174Kbytes, but it takes 10Mbytes of heap space to parse.
Have you confirmed that it takes 10MB of space to parse? Try running the file through Markup.ml only, and seeing it if it takes that much space. It could simply be that the memory represenation of the DOM in Lambda Soup takes 10MB (which may still be a problem).
The reason for my second comment is that we did use spacetime to look for memory leaks in Markup.ml, and we could easily see the sawtooth pattern of GC collecting memory. This means that either you've found a pathological case with a memory leak, or the graphs above are showing the memory consumption of the Lambda Soup tree being built up.
Have you confirmed that it takes 10MB of space to parse?
You are right! Parsing as such (open_in Sys.argv.(1) |> Markup.channel |> Markup.parse_html |> Markup.signals
) only takes about 4MB. But adding Soup.from_signals
makes the memory consumption of the test program jump to ~600MB.
Do you have a ready example of 13MB input, or instructions for generating it?
I've attached the file in question. 13M.html.gz
Is valgrind showing process heap usage, in terms of how much heap space has been requested from the kernel?
Yes, to my understanding, that's heap space requested from the kernel.
Is it showing heap usage at each point, or allocations performed so far?
That memtrace plot shows allocations that are live at a certain point.
I assume the former, since the curve sometimes goes down. This would suggest a memory leak.
From a real soupault run I could see that the memory usage goes down by orders of magnitude when it finishes processing the 13M page and goes on to smaller pages. So it's not a leak as such (the memory isn't becoming permanently allocated).
open_in Sys.argv.(1) |> Markup.channel |> Markup.parse_html |> Markup.signals
Could you more fully confirm this by draining the stream using Markup.drain
?
From a real soupault run I could see that the memory usage goes down by orders of magnitude when it finishes processing the 13M page and goes on to smaller pages.
Is that after you drop the reference to the DOM produced for the first, large page? There might still be effectively a memory leak within a big object graph (i.e. a bunch of unnecessary memory somehow retained).
What does Obj.reachable_words
return for the DOM you get?
Eyeballing the file and the current Lambda Soup element representation, I expect about ~10x blowup in size for each <span>
tag in the DOM version relative to its HTML form. The text nodes have less of a blowup. This can be optimized, but it wouldn't explain a 50x blowup.
Could you more fully confirm this by draining the stream using Markup.drain?
Interesting! It looks like calling Markup.drain
reclaims all that memory and solves the problem. I'm going to run more tests and get back to you.
I recommend also passing the ~report
argument to parse_html
and printing any errors:
~report:(fun location error -> Markup.Error.to_string ~location error |> print_endline)
just in case the issue is that Markup.ml is struggling in some error-recovery state from ill-formed HTML somewhere (unlikely, given you tried two different kinds of inputs).
Wait, I completely misunderstood the purpose of Markup.drain
. The point stands: Markup.channel |> Markup.drain
doesn't consume 600MB of RAM while processing that file.
But changing drain to Soup.from_signals
consumes all that RAM.
This is the latest version of the test program.
let _ =
let chan = open_in Sys.argv.(1) |> Markup.channel in
let signals = Markup.parse_html chan |> Markup.signals in
let s = Soup.from_signals signals in
let () = Gc.full_major () in
Printf.printf "Reachable words: %d" (Obj.reachable_words @@ Obj.repr s)
The output is: Reachable words: 24033120
.
The output is:
Reachable words: 24033120
.
That implies that the Lambda Soup DOM is about 190MB in size. Are there close to 24M live words at this point?
I also just realized that for single-character text nodes, the size blowup is about 80x relative to their in-file representation, due to all the overhead. The 13M file has a lot of text nodes containing only /
. This ratio might partially explain the 50x overall blowup.
Markup.channel |> Markup.drain
Were you literally draining the character stream returned by Markup.channel
? Or did you actually do
Markup.channel |> Markup.parse_html |> Markup.signals |> Markup.drain
i.e. actually run the parser and drain the signal stream?
Or did you actually do
Markup.channel |> Markup.parse_html |> Markup.signals |> Markup.drain
Sorry, I think the first time I did it wrong indeed! I've just done Markup.channel |> Markup.parse_html |> Markup.signals |> Markup.drain
and that took about 350M RAM. 350M for the parsing and 190 for the DOM, that seems to add up to the original RAM usage figure (more or less).
Also, that page seems nearly impossible to open in browser in practice, so I think the issue is mostly academic. Usable pages consume sane amounts of RAM, and pages that cause troubles don't seem usable. But I'm still curious if it could use less.
It's definitely possible to reduce the memory consumption of the DOM by at least a constant factor, like 2x or 1.5x. If willing to do more extensive and invasive optimizations, it could be possible to really reduce usage greatly.
I am concerned about the 350M for parsing. The parser should be able to "retire" most blocks from memory. I assume that GC isn't running during the parse — is that correct? Then it seems plausible. However, then I am concerned that in your user's VM, the OCaml runtime didn't force a garbage collection (or did it, and it was only the DOM part of the memory usage, which isn't eligible for collection, that caused the OOM?)
Also, that page seems nearly impossible to open in browser in practice
It's good to hear. I've generally found that Markup.ml (and Lambda Soup) have the same behavior as browsers, including error recovery, etc., so it's nice to hear that huge page behavior is if not the same, at least similarly unacceptable between both :) But as noted above, it's likely possible to improve Markup.ml and/or Lambda Soup with some ease to do a bit better.
As of now, I think:
self
field. A slightly tricker change is inlining the text field, and using it for both text node content and element names. That should save some block headers.node
returned from the surface Lambda Soup API would then have to be, if it is a text node, a pair of a parent node pointer and the pointer to the text node in the DOM.
- We could hash-cons the string values of text nodes and element names to reuse many blocks.
(also known as string pooling)
I'm not clear if you mean or fully checked whether the other 2/3 that arose in Markup.ml during parsing were eligible for GC or actually triggered the user's OOM
I'll try to do more detailed tests to confirm that.
From your comments above, it appears to me that your user has abandoned this input
The user has abandoned it I think. I also confirmed that GitHub refuses to display that file in the web UI and only offers an option to downloaded it raw.
So, while it definitely would be cool to make LambdaSoup/Markup.ml handle very large inputs faster than browsers, it will not help anyone actually view absurdly large files on the web and it may indeed be a wasted effort.
I'm going to close this issue since we looked into it, but it doesn't look like there is a need to solve it. We can return and do some optimizations upon future demand. Thank you!
I discovered that lambdasoup/markup.ml takes many times more heap space to parse a file than the file size. Consider a trivial test program:
This is a valgrind/massif report for parsing soupault's reference manual. The manual file is 174Kbytes, but it takes 10Mbytes of heap space to parse. That's tolerable.
With larger files it gets progressively worse. Parsing a 14MB file requires almost 600MB (45 times the size of the file) of memory allocations.
That is enough to cause a small CI worker to run out of memory and fail the build. That's how I discovered the issue: a soupault user reported OOM on a VM with 1G RAM when trying to process a 13M large auto-generated file. It's sure an extreme case, but not an invalid one.
From my memtrace profiling, it looks like the cause is that heavy use of CPS creates lots of unreachable allocations faster than the GC can clean up after it.
Do you think anything can be done to keep the allocation rate lower?
I've attached the memtrace.
trace.ctf.gz