codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.04k stars 2.11k forks source link

Should newspaper3k bypass a wall on ft.com or medium.com? #932

Open nwatab opened 2 years ago

nwatab commented 2 years ago

This issue asks about specification of newspaper3k. Some media company page (eg ft.com and medium.com) has a wall. newspaper3k doesn't go beyond. For example, when you parse https://www.ft.com/content/2f081189-01dd-4549-a6b0-ab4f04a103cd, you get

title: Subscribe to read
text: Become an FT subscriber to read:

Leverage our market expertise

Expert insights, analysis and smart data help you cut through the noise to spot trends, risks and opportunities.

Join over 300,000 Finance professionals who already subscribe to the FT.

Similar things happens on medium.com.

Technically there is a way to bypass (eg. https://github.com/iamadamdev/bypass-paywalls-chrome). Should newspaper3k support bypass?

johnbumgarner commented 2 years ago

This extension has to be used with a web browser, so it will not work with Newspaper, because it uses Python requests.

nwatab commented 2 years ago

sorry for confusing you. I have no intention to parse a physical paper.

pasenidis commented 2 years ago

Try parsing with 12ft.io

johnbumgarner commented 2 years ago

What do you mean try parsing with 12ft.io? Can you provide a parsing code example?