dart-lang / tools

This repository is home to tooling related Dart packages.
BSD 3-Clause "New" or "Revised" License
30 stars 22 forks source link

getElementsByTagName doesn't work on some sites #1035

Open sgehrman opened 4 years ago

sgehrman commented 4 years ago

I'm scraping a website to get the title and description and other meta data, but it's not working on all sites.

for example: https://www.youtube.com/watch?v=3AIZAGwMRg8

final List elements = document.head.getElementsByTagName('title');

elements returns []

But other sites work just fine, like https://apple.com

I'm also using:

  final List<Element> metas = document.head.getElementsByTagName('meta');

And on that site, I'm not seeing all the meta tags

TheYuriG commented 3 years ago

It won't work because all of that is rendered through javascript, which this library does not run.

Disable javascript before loading a page and then you can see what can be scraped and what cannot.

I installed a chrome extension to do this (https://chrome.google.com/webstore/detail/toggle-javascript/cidlcjdalomndpeagkjpnefhljffbnlo) but you can also do it by pressing F12 to open the console and then pressing `Cntr

If you NEED javascript, i recommend running a library like puppeteer first and then parsing that post-rendered HTML.

Youtube also has an API you can tap into, instead of scraping their site. See if that can fit your need somehow.

shawnlauzon commented 2 years ago

If you set the User-Agent to a bot when retrieving the document, then it will return all of the tags.