Not extracting YouTube links

jaypinho commented 5 years ago

I'm trying the following:

article = newspaper.Article(url="https://www.nytimes.com/2019/10/14/dining/pasta-grannies-youtube-cookbook.html", memoize_articles=False) article.download() article.parse() print(article.movies)

And it's returning an empty array, even though there are multiple YouTube videos linked to in the article.

Is the movies function intentionally only for embeds and not for links? If so, is there another way to obtain the list of videos that the article links to?

Kerl1310 commented 4 years ago

@jaypinho Looking at the code you're correct. newspaper looks for the following tags: VIDEOS_TAGS = ['iframe', 'embed', 'object', 'video'] VIDEO_PROVIDERS = ['youtube', 'vimeo', 'dailymotion', 'kewego']

If @codelucas thinks it's a worthwhile change, we could potentially use something like the following RegEx to try and extract the relevant links: http(?:s?):\/\/(?:www\.)?youtu(?:be\.com\/watch\?v=|\.be\/)([\w\-\_]*)(&(amp;)?‌[\w\?‌=]*)? In the meantime, you could extract the HTML using newspaper and then use the RegEx yourself?

jaypinho commented 4 years ago

Thanks! I'll try that.

codelucas / newspaper

Not extracting YouTube links #745