Phuks-co / throat

Open Source link aggregator and discussion platform powering Phuks
https://phuks.co
MIT License
73 stars 32 forks source link

Check whether html title is present before parsing in grab_title #135

Closed happy-river closed 4 years ago

happy-river commented 4 years ago

Do some cheap tests to determine if a fetched page contains HTML and has a title before parsing it to look for the title. This change makes grab_title reach the conclusion that a 4MB jpg doesn't have a title much quicker than before. I also cut the maximum size page that would be downloaded to check for a title down to ~2M from ~25M. I'm not sure what the right number is for that. 500K is too small and 25M seems way too large.

happy-river commented 4 years ago

@ferocious-ferret suggested I change this to only download the first 10k, and look in that for the title. I'll see what I can do about that today.

Polsaker commented 4 years ago

I'd leave it around ~100-500k, I've seen many pages that dump huge inline css/js blocks in the head before the title

happy-river commented 4 years ago

I changed it to read only the first 200K, and to truncate the HTML before parsing. I tried it with the limit at 100K and it couldn't get youtube titles.