LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.28k stars 884 forks source link

fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957

Closed phiresky closed 3 months ago

phiresky commented 3 months ago

Currently, for metadata extraction the full data is fetched, converted lossily to a string and then parsed as HTML.

This is expensive: It takes 10-20s of 100% CPU to parse a 20MB response. 1+MB responses are common for image, gif, video URLs.

This changes that method, it tries to fetch only the first 64kB of data and then only runs the expensive HTML parsing if there is no null byte in the data. After this, the whole API request only takes ~0.1 s (including the external fetch)

Fixes #4956.

Nothing4You commented 3 months ago

what about validating the returned content type before reading any data?

phiresky commented 3 months ago

It seemed like the code was intentionally modified in 2021-2022 to not do that, maybe due to wrong server responses, there's a comment in the code linking https://github.com/LemmyNet/lemmy/issues/1964 . But yes, that reason might not be valid or no longer valid or not the same.

Nothing4You commented 3 months ago

the html parser seems to be doing pretty good even with heavily truncated data, so it's probably fine to just do it that way. if we already truncate, do we still need to worry about a binary check though? the library seems to be doing fine with null bytes in the input as well:

Evaluating "<html><head><title>Hello</title><hea"
HTML parsed ok: true
HTML title: Some("Hello")
Evaluating "<html><head><title>Hell\0o World"
HTML parsed ok: true
HTML title: Some("Hell�o World")
phiresky commented 3 months ago

I've looked at the linked issue(s) again and don't see a reason to not use the content-type response. I've updated the PR to use the content-type header instead of checking for null bytes.

dessalines commented 3 months ago

This is probably due to me stupidly removing the doctype check in https://github.com/LemmyNet/lemmy/commit/55f84dd38ade98b864f0d99371826be0a00a0712#diff-0cd13a4a101c05a54d9ab789fa06c9d48ab61c0e1f5c9b9a178440e9ba3a5ef5L145 . I think adding that back in would be enough.

dessalines commented 3 months ago

CI is messing up, I'll try to see what's going on.

ticoombs commented 3 months ago

image Might be too early to say, but this has reduced my CPU by at-least half. Thanks