Anonyfox / meteor-scrape

Scrape any Website or RSS/Atom-Feed with ease.
GNU Lesser General Public License v3.0
71 stars 19 forks source link

Title and description not returned properly in some cases #14

Open mcoenca opened 9 years ago

mcoenca commented 9 years ago

In particular, when you scrape a page with a title with a dash "-" or '|" in it, the title is cut to only the first part.

Example: Scrape.website(https://www.youtube.com/watch?v=TvyWRevLG5I) It displays as title 'Ethereal Dreams' instead of 'Ethereal Dreams' - Chill Mix

Scrape.website(https://www.youtube.com/watch?v=RgLDHIUl4PA_ Only returns as title 13.Best of Chill Out instead of 13. Best of Chill Out | Ambient | New Age | Lounge... [HD]

Also, Description returns 'true' sometimes, without any meta description tag present.

Example Scrape.website(https://meteorhacks.com/introduction-to-latency-compensation.html) ... lang: 'en', I20150429-17:06:27.303(2)? description: 'true', I20150429-17:06:27.303(2)? favicon: 'https://meteorhacks.com/', I20150429-17:06:27.303(2)? references: I20150429-17:06:27.303(2)? [ 'https://bulletproofmeteor.com/?utm_source=meteorhacks&utm_medium=link&utm_term=meteorhacks&utm_content=homepage&utm_campaign=meteorhacks', I20150429-17:06:27.304(2)? 'https://kadira.io/?utm_source=meteorhacks&utm_medium=banner&utm_term=kadira&utm_content=toplink&utm_campaign=kadira', I20150429-17:06:27.304(2)? 'http://www.meteor.com/', ...

This seems not coherent, it should return 'undefined' or 'notFound'...

If i have some time i will try to submit a pull request, but they should not be too hard to fix :)

Thanks anyway

Anonyfox commented 9 years ago

title: yes, the current behaviour tries to find the best "title" from the page, be it the <title> tag, parts of it, or the first headline <h1>, and so on. I agree that this is somewhat confusing sometimes, we'll have a look at this.

description: this is a bug, thanks for reporting! I can reproduce it, shouldn't be too hard to fix.