LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.29k stars 883 forks source link

Metadata fetching fails on Youtube #5208

Open sallyFoster opened 4 days ago

sallyFoster commented 4 days ago

Recently, the YouTube urls are not autofilling post titles anymore. It would be great if you could bring back this feature.

Nutomic commented 4 days ago

You should fill in the issue template.

Anyway I can reproduce the problem. It happens because of https://github.com/LemmyNet/lemmy/pull/4957 which fetches only a limited amount of data for previews to reduce cpu usage. The problem is that youtube in particular includes a huge amount of javascript in the html file, so the opengraph tags appear much later than 64 bytes. I was able to fetch youtube metadata with bytes_to_fetch = 1000 * 1024 but that makes the optimization almost useless.

cc @phiresky

Nothing4You commented 4 days ago

I don't think that makes the optimization entirely useless, as it still avoids downloading multiple MB

dessalines commented 4 days ago

I'd be fine with increasing that, although its frustrating that youtube decides to insert a megabyte of javascript before the usable opengraph tags.

phiresky commented 3 days ago

Checking a random youtube video, the og:* tags are at around 500kB. Ridiculous how they insert so much crap before. A lot of the optimization in #4957 comes from looking at the Content-Type header. the byte limit is more as a fallback because this request is a DOS opportunity and you can of course change the Content-Type header and still send data that makes an HTML parser take ages to parse the result. One megabyte is maybe fine I guess, but it's not ideal. If we change the limit I think we should at least add a log that logs when the CPU use of this operation is > 100ms or so so we can tell when this is happening again (it was not a trivial thing to find)

phiresky commented 3 days ago

possible patch:

diff --git a/crates/api_common/src/request.rs b/crates/api_common/src/request.rs
index 36010f760..f2c2aeecc 100644
--- a/crates/api_common/src/request.rs
+++ b/crates/api_common/src/request.rs
@@ -32,10 +32,11 @@ use reqwest::{
 };
 use reqwest_middleware::ClientWithMiddleware;
 use serde::{Deserialize, Serialize};
-use tracing::info;
+use tracing::{info, warn};
 use url::Url;
 use urlencoding::encode;
 use webpage::HTML;
+use std::time::Instant;

 pub fn client_builder(settings: &Settings) -> ClientBuilder {
   let user_agent = format!("Lemmy/{VERSION}; +{}", settings.get_protocol_and_hostname());
@@ -51,9 +52,8 @@ pub fn client_builder(settings: &Settings) -> ClientBuilder {
 #[tracing::instrument(skip_all)]
 pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResult<LinkMetadata> {
   info!("Fetching site metadata for url: {}", url);
-  // We only fetch the first 64kB of data in order to not waste bandwidth especially for large
-  // binary files
-  let bytes_to_fetch = 64 * 1024;
+  // We fetch the first 1MB of data to ensure we get all metadata
+  let bytes_to_fetch = 1024 * 1024;
   let response = context
     .client()
     .get(url.as_str())
@@ -91,9 +91,24 @@ pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResu

       // only take first bytes regardless of how many bytes the server returns
       let html_bytes = collect_bytes_until_limit(response, bytes_to_fetch).await?;
-      extract_opengraph_data(&html_bytes, url)
+      
+      // Track parsing time
+      let parse_start = Instant::now();
+      let result = extract_opengraph_data(&html_bytes, url)
         .map_err(|e| info!("{e}"))
-        .unwrap_or_default()
+        .unwrap_or_default();
+      let parse_duration = parse_start.elapsed();
+      
+      // Log warning if parsing took longer than 100ms
+      if parse_duration.as_millis() > 100 {
+        warn!(
+          "HTML parsing for {} took {}ms, which exceeds the 100ms threshold",
+          url,
+          parse_duration.as_millis()
+        );
+      }
+      
+      result
     }
   };
   Ok(LinkMetadata {
Nothing4You commented 3 days ago

If I'm skimming over Mastodon source correctly, there is also a 1 MiB limit for fetching html for opengraph metadata:

https://github.com/mastodon/mastodon/blob/295ad6f19a016b3f16e1201ffcbb1b3ad6b455a2/app/services/fetch_link_card_service.rb#L54-L66

https://github.com/mastodon/mastodon/blob/295ad6f19a016b3f16e1201ffcbb1b3ad6b455a2/app/lib/request.rb#L213

Seems like this is a reasonable limit.

dessalines commented 3 days ago

@phiresky feel free to do a PR w/ that patch when you get a chance.