Closed phiresky closed 3 months ago
what about validating the returned content type before reading any data?
It seemed like the code was intentionally modified in 2021-2022 to not do that, maybe due to wrong server responses, there's a comment in the code linking https://github.com/LemmyNet/lemmy/issues/1964 . But yes, that reason might not be valid or no longer valid or not the same.
the html parser seems to be doing pretty good even with heavily truncated data, so it's probably fine to just do it that way. if we already truncate, do we still need to worry about a binary check though? the library seems to be doing fine with null bytes in the input as well:
Evaluating "<html><head><title>Hello</title><hea"
HTML parsed ok: true
HTML title: Some("Hello")
Evaluating "<html><head><title>Hell\0o World"
HTML parsed ok: true
HTML title: Some("Hell�o World")
I've looked at the linked issue(s) again and don't see a reason to not use the content-type response. I've updated the PR to use the content-type header instead of checking for null bytes.
This is probably due to me stupidly removing the doctype check in https://github.com/LemmyNet/lemmy/commit/55f84dd38ade98b864f0d99371826be0a00a0712#diff-0cd13a4a101c05a54d9ab789fa06c9d48ab61c0e1f5c9b9a178440e9ba3a5ef5L145 . I think adding that back in would be enough.
CI is messing up, I'll try to see what's going on.
Might be too early to say, but this has reduced my CPU by at-least half. Thanks
Currently, for metadata extraction the full data is fetched, converted lossily to a string and then parsed as HTML.
This is expensive: It takes 10-20s of 100% CPU to parse a 20MB response. 1+MB responses are common for image, gif, video URLs.
This changes that method, it tries to fetch only the first 64kB of data and then only runs the expensive HTML parsing if there is no null byte in the data. After this, the whole API request only takes ~0.1 s (including the external fetch)
Fixes #4956.