Open sallyFoster opened 4 days ago
You should fill in the issue template.
Anyway I can reproduce the problem. It happens because of https://github.com/LemmyNet/lemmy/pull/4957 which fetches only a limited amount of data for previews to reduce cpu usage. The problem is that youtube in particular includes a huge amount of javascript in the html file, so the opengraph tags appear much later than 64 bytes. I was able to fetch youtube metadata with bytes_to_fetch = 1000 * 1024
but that makes the optimization almost useless.
cc @phiresky
I don't think that makes the optimization entirely useless, as it still avoids downloading multiple MB
I'd be fine with increasing that, although its frustrating that youtube decides to insert a megabyte of javascript before the usable opengraph tags.
Checking a random youtube video, the og:* tags are at around 500kB. Ridiculous how they insert so much crap before. A lot of the optimization in #4957 comes from looking at the Content-Type header. the byte limit is more as a fallback because this request is a DOS opportunity and you can of course change the Content-Type header and still send data that makes an HTML parser take ages to parse the result. One megabyte is maybe fine I guess, but it's not ideal. If we change the limit I think we should at least add a log that logs when the CPU use of this operation is > 100ms or so so we can tell when this is happening again (it was not a trivial thing to find)
possible patch:
diff --git a/crates/api_common/src/request.rs b/crates/api_common/src/request.rs
index 36010f760..f2c2aeecc 100644
--- a/crates/api_common/src/request.rs
+++ b/crates/api_common/src/request.rs
@@ -32,10 +32,11 @@ use reqwest::{
};
use reqwest_middleware::ClientWithMiddleware;
use serde::{Deserialize, Serialize};
-use tracing::info;
+use tracing::{info, warn};
use url::Url;
use urlencoding::encode;
use webpage::HTML;
+use std::time::Instant;
pub fn client_builder(settings: &Settings) -> ClientBuilder {
let user_agent = format!("Lemmy/{VERSION}; +{}", settings.get_protocol_and_hostname());
@@ -51,9 +52,8 @@ pub fn client_builder(settings: &Settings) -> ClientBuilder {
#[tracing::instrument(skip_all)]
pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResult<LinkMetadata> {
info!("Fetching site metadata for url: {}", url);
- // We only fetch the first 64kB of data in order to not waste bandwidth especially for large
- // binary files
- let bytes_to_fetch = 64 * 1024;
+ // We fetch the first 1MB of data to ensure we get all metadata
+ let bytes_to_fetch = 1024 * 1024;
let response = context
.client()
.get(url.as_str())
@@ -91,9 +91,24 @@ pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResu
// only take first bytes regardless of how many bytes the server returns
let html_bytes = collect_bytes_until_limit(response, bytes_to_fetch).await?;
- extract_opengraph_data(&html_bytes, url)
+
+ // Track parsing time
+ let parse_start = Instant::now();
+ let result = extract_opengraph_data(&html_bytes, url)
.map_err(|e| info!("{e}"))
- .unwrap_or_default()
+ .unwrap_or_default();
+ let parse_duration = parse_start.elapsed();
+
+ // Log warning if parsing took longer than 100ms
+ if parse_duration.as_millis() > 100 {
+ warn!(
+ "HTML parsing for {} took {}ms, which exceeds the 100ms threshold",
+ url,
+ parse_duration.as_millis()
+ );
+ }
+
+ result
}
};
Ok(LinkMetadata {
If I'm skimming over Mastodon source correctly, there is also a 1 MiB limit for fetching html for opengraph metadata:
Seems like this is a reasonable limit.
@phiresky feel free to do a PR w/ that patch when you get a chance.
Recently, the YouTube urls are not autofilling post titles anymore. It would be great if you could bring back this feature.