Closed prnake closed 3 months ago
Is it the WARC reader or the HTML parser?
It's caused by the HTML parser.
FYI Haven't forgotten you, but I will probably not have time for this before next week.
I'm trying to get to the bottom of this and I can reproduce the behaviour, but I'm suspecting this is not in Resiliparse, at least not in the HTML parser.
Memory usage does grow and the loop times go up as well, but not in a strictly monotonic fashion and it does go back to normal after the loop. You can see this with a modified version of your test code:
import os
import psutil
import time
from fastwarc.warc import ArchiveIterator as FastWarcArchiveIterator
from fastwarc.warc import WarcRecordType
from fastwarc.warc import is_http
from resiliparse.parse.html import HTMLTree
for r in range(3):
print(f'-------\nRound {r + 1}\n-------')
file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')
archive_iterator = FastWarcArchiveIterator(
file,
record_types=WarcRecordType.response,
parse_http=True,
func_filter=is_http,
)
s = time.monotonic()
process = psutil.Process(os.getpid())
for idx, record in enumerate(archive_iterator):
raw = record.reader.read()
try:
str(HTMLTree.parse(raw.decode('utf-8')))
except:
pass
if (idx + 1) % 5000 == 0:
print(f'In loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB, {round(time.monotonic() - s, 3)}s')
s = time.monotonic()
print(f'After loop {round(process.memory_info().rss / (1024 ** 2), 3)}MB')
This prints stats only ever 5k iterations and after the loop. The whole thing is run three times. These are the results:
-------
Round 1
-------
In loop 38.895MB, 3.704s
In loop 51.941MB, 6.022s
In loop 48.191MB, 5.294s
In loop 51.715MB, 5.336s
In loop 51.039MB, 5.36s
In loop 52.727MB, 6.726s
In loop 59.426MB, 6.596s
After loop 59.426MB
-------
Round 2
-------
In loop 59.426MB, 3.462s
In loop 56.117MB, 6.315s
In loop 56.305MB, 5.681s
In loop 55.559MB, 5.651s
In loop 59.508MB, 5.465s
In loop 57.93MB, 6.182s
In loop 66.922MB, 6.151s
After loop 66.922MB
-------
Round 3
-------
In loop 66.922MB, 3.314s
In loop 66.922MB, 5.8s
In loop 57.469MB, 5.459s
In loop 57.453MB, 5.344s
In loop 61.523MB, 5.485s
In loop 58.355MB, 6.403s
In loop 65.293MB, 6.268s
After loop 65.293MB
When I comment out the HTML parsing part, I get similar results, just with smaller numbers:
-------
Round 1
-------
In loop 29.07MB, 1.395s
In loop 28.703MB, 2.032s
In loop 28.996MB, 2.069s
In loop 26.387MB, 2.067s
In loop 28.926MB, 2.194s
In loop 28.824MB, 2.687s
In loop 28.512MB, 2.722s
After loop 28.215MB
-------
Round 2
-------
In loop 27.094MB, 1.339s
In loop 27.316MB, 2.067s
In loop 28.434MB, 2.096s
In loop 26.07MB, 2.1s
In loop 28.52MB, 2.174s
In loop 28.613MB, 2.717s
In loop 28.441MB, 2.727s
After loop 28.734MB
-------
Round 3
-------
In loop 26.672MB, 1.383s
In loop 28.172MB, 2.092s
In loop 28.465MB, 2.11s
In loop 26.027MB, 2.087s
In loop 28.547MB, 2.259s
In loop 28.727MB, 2.786s
In loop 28.379MB, 2.74s
After loop 28.105MB
The problem could still be in the WARC iterator, but it doesn't matter whether I do record.reader.read()
or not.
I believe this comes down to two things:
BufferedReader.read()
for larger documentsI've pushed some changes to address the first issue (see https://github.com/chatnoir-eu/chatnoir-resiliparse/commit/070c5bea7409348f3f74120031f6eb1b183bf6f6, should be up as version 0.14.7 soon).
The second issue seems to stem from the peculiarity that later documents in the CC WARC seem to be larger for some reason. Here's the updated test case (with proper encoding detection):
import os
import psutil
import time
from fastwarc.warc import ArchiveIterator, is_http, WarcRecordType
from resiliparse.parse.html import HTMLTree
from resiliparse.parse.encoding import detect_encoding
file = open('CC-MAIN-20231005012006-20231005042006-00899.warc.gz', 'rb')
archive_iterator = ArchiveIterator(
file,
record_types=WarcRecordType.response,
parse_http=True,
func_filter=is_http,
)
s = time.monotonic()
doc_size = 0
process = psutil.Process(os.getpid())
for idx, record in enumerate(archive_iterator):
raw = record.reader.read()
doc_size += len(raw)
str(HTMLTree.parse_from_bytes(raw, detect_encoding(raw)))
if (idx + 1) % 5000 == 0:
t = time.monotonic() - s
print(f'- Mem: {process.memory_info().rss / (1024 ** 2):.3f}MB, Time: {t:.3f}s')
print(f' Avg doc size: {doc_size / 5000:,.0f} bytes ({doc_size / t:,.0f} bytes/s)')
doc_size = 0
s = time.monotonic()
and this is the output I get:
- Mem: 53.504MB, Time: 12.782s
Avg doc size: 76,857 bytes (30,064,370 bytes/s)
- Mem: 62.117MB, Time: 17.384s
Avg doc size: 132,829 bytes (38,204,150 bytes/s)
- Mem: 62.492MB, Time: 16.595s
Avg doc size: 137,767 bytes (41,508,993 bytes/s)
- Mem: 62.867MB, Time: 16.600s
Avg doc size: 136,393 bytes (41,083,262 bytes/s)
- Mem: 67.742MB, Time: 18.536s
Avg doc size: 145,028 bytes (39,121,232 bytes/s)
- Mem: 71.680MB, Time: 21.965s
Avg doc size: 179,860 bytes (40,942,038 bytes/s)
- Mem: 71.680MB, Time: 21.583s
Avg doc size: 179,028 bytes (41,473,469 bytes/s)
The rate per actual processed byte seems quite constant to me.
I'm trying to use Resiliparse to handle CC, but I find that single-process memory usage grows slowly, and web pages get slower, and I can't find the source of the problem with the memory analysis tool, it feels like some sort of memory leak.
And here is the output: