codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.13k stars 2.12k forks source link

Not extracting YouTube links #745

Open jaypinho opened 5 years ago

jaypinho commented 5 years ago

I'm trying the following:

article = newspaper.Article(url="https://www.nytimes.com/2019/10/14/dining/pasta-grannies-youtube-cookbook.html", memoize_articles=False) article.download() article.parse() print(article.movies)

And it's returning an empty array, even though there are multiple YouTube videos linked to in the article.

Is the movies function intentionally only for embeds and not for links? If so, is there another way to obtain the list of videos that the article links to?

Kerl1310 commented 4 years ago

@jaypinho Looking at the code you're correct. newspaper looks for the following tags: VIDEOS_TAGS = ['iframe', 'embed', 'object', 'video'] VIDEO_PROVIDERS = ['youtube', 'vimeo', 'dailymotion', 'kewego']

If @codelucas thinks it's a worthwhile change, we could potentially use something like the following RegEx to try and extract the relevant links: http(?:s?):\/\/(?:www\.)?youtu(?:be\.com\/watch\?v=|\.be\/)([\w\-\_]*)(&(amp;)?‌​[\w\?‌​=]*)? In the meantime, you could extract the HTML using newspaper and then use the RegEx yourself?

jaypinho commented 4 years ago

Thanks! I'll try that.