Phuks-co / throat

Open Source link aggregator and discussion platform powering Phuks
https://phuks.co
MIT License
73 stars 32 forks source link

Truncate HTML for thumbnail parsing after last meta tag #267

Closed happy-river closed 3 years ago

happy-river commented 3 years ago

YouTube is sometimes serving pages with the meta tags in the body of the HTML instead of the head. Since we are only looking in the head for the meta tag with the thumbnail, this means we miss it and use the favicon instead. The reason for only looking for the thumbnail in part of the HTML document is performance, since large web pages can take upwards of 100 ms to parse (which being cpu-intensive, blocks the gevent loop), but parsing just the header can be done in less than 10 ms.

Change the thumbnail search to search the HTML to the end of the last meta tag, instead of the end of the <head> section. This makes YouTube thumbnails work and since they put the meta tags at the start of the <body>, preserves the performance gain of doing a partial search.

bs4 provides a function called SoupStrainer to only parse tags of interest, but I didn't find a noticable performance improvement when using it.

I also added a log message to record the time spent doing HTML parsing.