aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
383 stars 31 forks source link

Parsing a file of size X can take up to 50X heap space #47

Closed dmbaturin closed 3 months ago

dmbaturin commented 2 years ago

I discovered that lambdasoup/markup.ml takes many times more heap space to parse a file than the file size. Consider a trivial test program:

let src = Soup.read_file (Sys.argv.(1))

let soup = Soup.parse src 

This is a valgrind/massif report for parsing soupault's reference manual. The manual file is 174Kbytes, but it takes 10Mbytes of heap space to parse. That's tolerable.

Screenshot at 2022-02-04 18-45-12

With larger files it gets progressively worse. Parsing a 14MB file requires almost 600MB (45 times the size of the file) of memory allocations.

Screenshot at 2022-02-04 18-49-21

That is enough to cause a small CI worker to run out of memory and fail the build. That's how I discovered the issue: a soupault user reported OOM on a VM with 1G RAM when trying to process a 13M large auto-generated file. It's sure an extreme case, but not an invalid one.

From my memtrace profiling, it looks like the cause is that heavy use of CPS creates lots of unreachable allocations faster than the GC can clean up after it.

Screenshot 2022-02-04 at 19-11-27 Screenshot

Do you think anything can be done to keep the allocation rate lower?

I've attached the memtrace.

trace.ctf.gz

aantron commented 2 years ago

Do you have a ready example of 13MB input, or instructions for generating it?

I'm not familiar with the output of either tool. What is being shown here? Is valgrind showing process heap usage, in terms of how much heap space has been requested from the kernel? The C heap usage? The OCaml heap usage? Would we see OCaml garbage collection in valgrind output?

The graph in the memtrace output has no label on the Y axis. Beneath, it says "allocations" in a sentence. Is it showing heap usage at each point, or allocations performed so far? I assume the former, since the curve sometimes goes down. This would suggest a memory leak.

aantron commented 2 years ago

The manual file is 174Kbytes, but it takes 10Mbytes of heap space to parse.

Have you confirmed that it takes 10MB of space to parse? Try running the file through Markup.ml only, and seeing it if it takes that much space. It could simply be that the memory represenation of the DOM in Lambda Soup takes 10MB (which may still be a problem).

aantron commented 2 years ago

The reason for my second comment is that we did use spacetime to look for memory leaks in Markup.ml, and we could easily see the sawtooth pattern of GC collecting memory. This means that either you've found a pathological case with a memory leak, or the graphs above are showing the memory consumption of the Lambda Soup tree being built up.

dmbaturin commented 2 years ago

Have you confirmed that it takes 10MB of space to parse?

You are right! Parsing as such (open_in Sys.argv.(1) |> Markup.channel |> Markup.parse_html |> Markup.signals) only takes about 4MB. But adding Soup.from_signals makes the memory consumption of the test program jump to ~600MB.

Do you have a ready example of 13MB input, or instructions for generating it?

I've attached the file in question. 13M.html.gz

Is valgrind showing process heap usage, in terms of how much heap space has been requested from the kernel?

Yes, to my understanding, that's heap space requested from the kernel.

Is it showing heap usage at each point, or allocations performed so far?

That memtrace plot shows allocations that are live at a certain point.

I assume the former, since the curve sometimes goes down. This would suggest a memory leak.

From a real soupault run I could see that the memory usage goes down by orders of magnitude when it finishes processing the 13M page and goes on to smaller pages. So it's not a leak as such (the memory isn't becoming permanently allocated).

aantron commented 2 years ago

open_in Sys.argv.(1) |> Markup.channel |> Markup.parse_html |> Markup.signals

Could you more fully confirm this by draining the stream using Markup.drain?

From a real soupault run I could see that the memory usage goes down by orders of magnitude when it finishes processing the 13M page and goes on to smaller pages.

Is that after you drop the reference to the DOM produced for the first, large page? There might still be effectively a memory leak within a big object graph (i.e. a bunch of unnecessary memory somehow retained).

aantron commented 2 years ago

What does Obj.reachable_words return for the DOM you get?

aantron commented 2 years ago

Eyeballing the file and the current Lambda Soup element representation, I expect about ~10x blowup in size for each <span> tag in the DOM version relative to its HTML form. The text nodes have less of a blowup. This can be optimized, but it wouldn't explain a 50x blowup.

dmbaturin commented 2 years ago

Could you more fully confirm this by draining the stream using Markup.drain?

Interesting! It looks like calling Markup.drain reclaims all that memory and solves the problem. I'm going to run more tests and get back to you.

aantron commented 2 years ago

I recommend also passing the ~report argument to parse_html and printing any errors:

~report:(fun location error -> Markup.Error.to_string ~location error |> print_endline)

just in case the issue is that Markup.ml is struggling in some error-recovery state from ill-formed HTML somewhere (unlikely, given you tried two different kinds of inputs).

dmbaturin commented 2 years ago

Wait, I completely misunderstood the purpose of Markup.drain. The point stands: Markup.channel |> Markup.drain doesn't consume 600MB of RAM while processing that file.

But changing drain to Soup.from_signals consumes all that RAM.

This is the latest version of the test program.

let _ = 
  let chan = open_in Sys.argv.(1) |> Markup.channel in
  let signals = Markup.parse_html chan |> Markup.signals in
  let s = Soup.from_signals signals in
  let () = Gc.full_major () in
  Printf.printf "Reachable words: %d" (Obj.reachable_words @@ Obj.repr s)

The output is: Reachable words: 24033120.

aantron commented 2 years ago

The output is: Reachable words: 24033120.

That implies that the Lambda Soup DOM is about 190MB in size. Are there close to 24M live words at this point?

aantron commented 2 years ago

I also just realized that for single-character text nodes, the size blowup is about 80x relative to their in-file representation, due to all the overhead. The 13M file has a lot of text nodes containing only /. This ratio might partially explain the 50x overall blowup.

aantron commented 2 years ago

Markup.channel |> Markup.drain

Were you literally draining the character stream returned by Markup.channel? Or did you actually do

Markup.channel |> Markup.parse_html |> Markup.signals |> Markup.drain

i.e. actually run the parser and drain the signal stream?

dmbaturin commented 2 years ago

Or did you actually do Markup.channel |> Markup.parse_html |> Markup.signals |> Markup.drain

Sorry, I think the first time I did it wrong indeed! I've just done Markup.channel |> Markup.parse_html |> Markup.signals |> Markup.drain and that took about 350M RAM. 350M for the parsing and 190 for the DOM, that seems to add up to the original RAM usage figure (more or less).

Also, that page seems nearly impossible to open in browser in practice, so I think the issue is mostly academic. Usable pages consume sane amounts of RAM, and pages that cause troubles don't seem usable. But I'm still curious if it could use less.

aantron commented 2 years ago

It's definitely possible to reduce the memory consumption of the DOM by at least a constant factor, like 2x or 1.5x. If willing to do more extensive and invasive optimizations, it could be possible to really reduce usage greatly.

I am concerned about the 350M for parsing. The parser should be able to "retire" most blocks from memory. I assume that GC isn't running during the parse — is that correct? Then it seems plausible. However, then I am concerned that in your user's VM, the OCaml runtime didn't force a garbage collection (or did it, and it was only the DOM part of the memory usage, which isn't eligible for collection, that caused the OOM?)

aantron commented 2 years ago

Also, that page seems nearly impossible to open in browser in practice

It's good to hear. I've generally found that Markup.ml (and Lambda Soup) have the same behavior as browsers, including error recovery, etc., so it's nice to hear that huge page behavior is if not the same, at least similarly unacceptable between both :) But as noted above, it's likely possible to improve Markup.ml and/or Lambda Soup with some ease to do a bit better.

aantron commented 2 years ago

As of now, I think:

aantron commented 2 years ago
  • We could hash-cons the string values of text nodes and element names to reuse many blocks.

(also known as string pooling)

dmbaturin commented 2 years ago

I'm not clear if you mean or fully checked whether the other 2/3 that arose in Markup.ml during parsing were eligible for GC or actually triggered the user's OOM

I'll try to do more detailed tests to confirm that.

From your comments above, it appears to me that your user has abandoned this input

The user has abandoned it I think. I also confirmed that GitHub refuses to display that file in the web UI and only offers an option to downloaded it raw.

So, while it definitely would be cool to make LambdaSoup/Markup.ml handle very large inputs faster than browsers, it will not help anyone actually view absurdly large files on the web and it may indeed be a wasted effort.

aantron commented 3 months ago

I'm going to close this issue since we looked into it, but it doesn't look like there is a need to solve it. We can return and do some optimizations upon future demand. Thank you!