chatnoir-eu / chatnoir-resiliparse

A robust web archive analytics toolkit
https://resiliparse.chatnoir.eu
Apache License 2.0
80 stars 11 forks source link

steady memory grouth while working on web pages #40

Closed prnake closed 3 months ago

prnake commented 6 months ago

I'm trying to use Resiliparse to handle CC, but I find that single-process memory usage grows slowly, and web pages get slower, and I can't find the source of the problem with the memory analysis tool, it feels like some sort of memory leak.

from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType, WarcRecord
from fastwarc.stream_io import FastWARCError
from fastwarc.warc import is_http

from resiliparse.parse.html import HTMLTree

import os
import psutil
import time

file = open("../CC-MAIN-20231129041834-20231129071834-00864.warc.gz", "rb")

archive_iterator = FastWarcArchiveIterator(
    file,
    record_types=WarcRecordType.response,
    parse_http=True,
    func_filter=is_http,
)
idx = 0
s = time.time()
process = psutil.Process(os.getpid())
for record in archive_iterator:
    raw = record.reader.read()
    try:
        str(HTMLTree.parse(raw.decode("utf-8")))
    except:
        pass

    idx += 1
    if idx % 1000 == 0:
        print(process.memory_info().rss / 1024**2, time.time() - s)
        s = time.time()

And here is the output:

91.0625 0.5237786769866943
108.921875 0.5350120067596436
132.015625 0.6991021633148193
132.828125 0.6604471206665039
144.25 0.6309971809387207
149.703125 0.9664597511291504
163.6875 1.0105879306793213
168.65625 0.9966611862182617
175.578125 1.1011531352996826
219.890625 0.9911088943481445
233.671875 1.1710970401763916
238.421875 1.0304219722747803
238.453125 0.9967031478881836
242.5625 1.016160011291504
246.40625 1.0688650608062744
246.40625 1.0952439308166504
246.40625 1.100167989730835
248.84375 1.077517032623291
248.84375 1.0870559215545654
248.84375 1.0127251148223877
248.84375 1.1437797546386719
253.125 1.0540661811828613
255.75 1.1424810886383057
255.75 1.0953960418701172
261.203125 1.1094980239868164
262.96875 1.293619155883789
265.6875 1.2223970890045166
265.6875 1.2639102935791016
265.6875 1.2537767887115479
265.6875 1.1750257015228271
266.203125 1.2749638557434082
266.203125 1.2159233093261719
272.484375 1.2846689224243164
275.515625 1.2536969184875488
275.515625 1.1752068996429443
275.515625 1.156343936920166
275.515625 1.249042272567749
276.765625 1.1668882369995117
phoerious commented 5 months ago

Is it the WARC reader or the HTML parser?

prnake commented 5 months ago

It's caused by the HTML parser.

phoerious commented 5 months ago

FYI Haven't forgotten you, but I will probably not have time for this before next week.

phoerious commented 5 months ago

I'm trying to get to the bottom of this and I can reproduce the behaviour, but I'm suspecting this is not in Resiliparse, at least not in the HTML parser.

Memory usage does grow and the loop times go up as well, but not in a strictly monotonic fashion and it does go back to normal after the loop. You can see this with a modified version of your test code:

import os
import psutil
import time

from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType
from fastwarc.warc import is_http
from resiliparse.parse.html import HTMLTree

for r in range(3):
    print(f'-------\nRound {r + 1}\n-------')
    file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')

    archive_iterator = FastWarcArchiveIterator(
        file,
        record_types=WarcRecordType.response,
        parse_http=True,
        func_filter=is_http,
    )
    s = time.monotonic()
    process = psutil.Process(os.getpid())
    for idx, record in enumerate(archive_iterator):
        raw = record.reader.read()
        try:
            str(HTMLTree.parse(raw.decode('utf-8')))
        except:
            pass

        if (idx + 1) % 5000 == 0:
            print(f'In loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB, {round(time.monotonic() - s, 3)}s')
            s = time.monotonic()

    print(f'After loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB')

This prints stats only ever 5k iterations and after the loop. The whole thing is run three times. These are the results:

-------
Round 1
-------
In loop 38.895MB, 3.704s
In loop 51.941MB, 6.022s
In loop 48.191MB, 5.294s
In loop 51.715MB, 5.336s
In loop 51.039MB, 5.36s
In loop 52.727MB, 6.726s
In loop 59.426MB, 6.596s
After loop 59.426MB
-------
Round 2
-------
In loop 59.426MB, 3.462s
In loop 56.117MB, 6.315s
In loop 56.305MB, 5.681s
In loop 55.559MB, 5.651s
In loop 59.508MB, 5.465s
In loop 57.93MB, 6.182s
In loop 66.922MB, 6.151s
After loop 66.922MB
-------
Round 3
-------
In loop 66.922MB, 3.314s
In loop 66.922MB, 5.8s
In loop 57.469MB, 5.459s
In loop 57.453MB, 5.344s
In loop 61.523MB, 5.485s
In loop 58.355MB, 6.403s
In loop 65.293MB, 6.268s
After loop 65.293MB

When I comment out the HTML parsing part, I get similar results, just with smaller numbers:

-------
Round 1
-------
In loop 29.07MB, 1.395s
In loop 28.703MB, 2.032s
In loop 28.996MB, 2.069s
In loop 26.387MB, 2.067s
In loop 28.926MB, 2.194s
In loop 28.824MB, 2.687s
In loop 28.512MB, 2.722s
After loop 28.215MB
-------
Round 2
-------
In loop 27.094MB, 1.339s
In loop 27.316MB, 2.067s
In loop 28.434MB, 2.096s
In loop 26.07MB, 2.1s
In loop 28.52MB, 2.174s
In loop 28.613MB, 2.717s
In loop 28.441MB, 2.727s
After loop 28.734MB
-------
Round 3
-------
In loop 26.672MB, 1.383s
In loop 28.172MB, 2.092s
In loop 28.465MB, 2.11s
In loop 26.027MB, 2.087s
In loop 28.547MB, 2.259s
In loop 28.727MB, 2.786s
In loop 28.379MB, 2.74s
After loop 28.105MB

The problem could still be in the WARC iterator, but it doesn't matter whether I do record.reader.read() or not.

phoerious commented 5 months ago

I believe this comes down to two things:

  1. Slight inefficiencies in the buffer (re-)allocation in BufferedReader.read() for larger documents
  2. Measurement biases due to unequal document sizes

I've pushed some changes to address the first issue (see https://github.com/chatnoir-eu/chatnoir-resiliparse/commit/070c5bea7409348f3f74120031f6eb1b183bf6f6, should be up as version 0.14.7 soon).

The second issue seems to stem from the peculiarity that later documents in the CC WARC seem to be larger for some reason. Here's the updated test case (with proper encoding detection):

import os
import psutil
import time

from fastwarc.warc import ArchiveIterator, is_http, WarcRecordType
from resiliparse.parse.html import HTMLTree
from resiliparse.parse.encoding import detect_encoding

file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')

archive_iterator = ArchiveIterator(
    file,
    record_types=WarcRecordType.response,
    parse_http=True,
    func_filter=is_http,
)
s = time.monotonic()
doc_size = 0
process = psutil.Process(os.getpid())
for idx, record in enumerate(archive_iterator):
    raw = record.reader.read()
    doc_size += len(raw)
    str(HTMLTree.parse_from_bytes(raw, detect_encoding(raw)))

    if (idx + 1) % 5000 == 0:
        t = time.monotonic() - s
        print(f'- Mem: {process.memory_info().rss / (1024 ** 2):.3f}MB, Time: {t:.3f}s')
        print(f'  Avg doc size: {doc_size / 5000:,.0f} bytes ({doc_size / t:,.0f} bytes/s)')
        doc_size = 0
        s = time.monotonic()

and this is the output I get:

- Mem: 53.504MB, Time: 12.782s
  Avg doc size: 76,857 bytes (30,064,370 bytes/s)
- Mem: 62.117MB, Time: 17.384s
  Avg doc size: 132,829 bytes (38,204,150 bytes/s)
- Mem: 62.492MB, Time: 16.595s
  Avg doc size: 137,767 bytes (41,508,993 bytes/s)
- Mem: 62.867MB, Time: 16.600s
  Avg doc size: 136,393 bytes (41,083,262 bytes/s)
- Mem: 67.742MB, Time: 18.536s
  Avg doc size: 145,028 bytes (39,121,232 bytes/s)
- Mem: 71.680MB, Time: 21.965s
  Avg doc size: 179,860 bytes (40,942,038 bytes/s)
- Mem: 71.680MB, Time: 21.583s
  Avg doc size: 179,028 bytes (41,473,469 bytes/s)

The rate per actual processed byte seems quite constant to me.