codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Does not fetch arabic news #869

Open ghost opened 3 years ago

ghost commented 3 years ago

Hello, I tried it but it did not fetch Arabic news such as https://www.alarabiya.net/ I got zero article.

My code:

news_paper = newspaper3k.build('https://www.alarabiya.net/', language='ar', memoize_articles=False) 
johnbumgarner commented 3 years ago

Newspaper will obtain article information from the target website, but it requires additional code to bypass the "accept all cookies" prompt which has to be clicked. Take a look at the examples on my newspaper3 usage overview document.

ghost commented 3 years ago

I reviewed the examples but did not figure out how to bypass the cookies. I appreciate your help

johnbumgarner commented 3 years ago

The overview talks about using selenium to bypass the "accept all cookies" prompt on website that require you to click them before accessing content. I will look into writing an example for https://www.alarabiya.net, but it will take a couple of days, before I can get to it and update the overview document.

ghost commented 3 years ago

Sounds great. I appreciate it.

johnbumgarner commented 3 years ago

I added a scraping example in my Newspaper overview document for the Al Arabiya website. Please note that I didn't build an entire solution for you. All the info to finish the code is in my overview document, which you can add to the other code yourself. Additionally, you will need to determine what urls are important to you, because I don't read Arabic, so it's hard for me to pick the correct items. Good luck.

P.S. Don't forget to close this issue, because it has been solved.

johnbumgarner commented 3 years ago

Sounds great. I appreciate it.

@moh55m55 have you tested my code that I posted on 01-21-2021.