dart-lang / html

Dart port of html5lib. For parsing HTML/HTML5 with Dart. Works in the client and on the server.
https://pub.dev/packages/html
Other
272 stars 58 forks source link

getElementsByTagName doesn't work on some sites #121

Open sgehrman opened 3 years ago

sgehrman commented 3 years ago

I'm scraping a website to get the title and description and other meta data, but it's not working on all sites.

for example: https://www.youtube.com/watch?v=3AIZAGwMRg8

final List elements = document.head.getElementsByTagName('title');

elements returns []

But other sites work just fine, like https://apple.com

I'm also using:

  final List<Element> metas = document.head.getElementsByTagName('meta');

And on that site, I'm not seeing all the meta tags

TheYuriG commented 3 years ago

It won't work because all of that is rendered through javascript, which this library does not run.

Disable javascript before loading a page and then you can see what can be scraped and what cannot.

I installed a chrome extension to do this (https://chrome.google.com/webstore/detail/toggle-javascript/cidlcjdalomndpeagkjpnefhljffbnlo) but you can also do it by pressing F12 to open the console and then pressing `Cntr

If you NEED javascript, i recommend running a library like puppeteer first and then parsing that post-rendered HTML.

Youtube also has an API you can tap into, instead of scraping their site. See if that can fit your need somehow.

shawnlauzon commented 1 year ago

If you set the User-Agent to a bot when retrieving the document, then it will return all of the tags.