getElementsByTagName doesn't work on some sites

sgehrman commented 4 years ago

I'm scraping a website to get the title and description and other meta data, but it's not working on all sites.

final List elements = document.head.getElementsByTagName('title');

elements returns []

But other sites work just fine, like https://apple.com

I'm also using:

  final List<Element> metas = document.head.getElementsByTagName('meta');

And on that site, I'm not seeing all the meta tags

TheYuriG commented 3 years ago

It won't work because all of that is rendered through javascript, which this library does not run.

Disable javascript before loading a page and then you can see what can be scraped and what cannot.

I installed a chrome extension to do this (https://chrome.google.com/webstore/detail/toggle-javascript/cidlcjdalomndpeagkjpnefhljffbnlo) but you can also do it by pressing F12 to open the console and then pressing `Cntr

Shift + P` to open the command line, then just type javascript and the option is going to show up for you.

If you NEED javascript, i recommend running a library like puppeteer first and then parsing that post-rendered HTML.

Youtube also has an API you can tap into, instead of scraping their site. See if that can fit your need somehow.

shawnlauzon commented 2 years ago

If you set the User-Agent to a bot when retrieving the document, then it will return all of the tags.

dart-lang / tools