codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.04k stars 2.11k forks source link

Parsing Incorrectly for Yahoo Finance #855

Open jtara1 opened 3 years ago

jtara1 commented 3 years ago

for article links like

url action required to view all
https://finance.yahoo.com/m/ca7cfbea-6f81-34b7-a482-1f7723376eff/ford-crushes-q3-views-with.html Read More button
https://finance.yahoo.com/news/epa-data-shows-tesla-excels-200035538.html
https://finance.yahoo.com/news/q3-gdp-gross-domestic-product-usa-coronavirus-pandemic-181533194.html Story Continues button to view the remainder of the text

use case

# url = 'https://finance.yahoo.com/m/ca7cfbea-6f81-34b7-a482-1f7723376eff/ford-crushes-q3-views-with.html'  # needs redirect (clicking on Read More)
url = 'https://finance.yahoo.com/news/epa-data-shows-tesla-excels-200035538.html'

from newspaper import Article

a = Article(url)
a.download()
a.parse()
print(a.text[:300])  # attach a debugger

I'm seeing it grab the article text for a suggested article that's down below on the page for the value of a.text.

sysmetic commented 3 years ago

Did you solve the issue? could you tell me how to handle?

jtara1 commented 3 years ago

I did solve it partially. I'm checking if the show more button leads to another site and running article parsing on that page. Otherwise, I'm grabbing the text and title and just assign those values to article.title article.text.

I only care about text and title. I'll post implementation in a day or 2. Using pyquery to do html parsing with css selectors.

DrElyt commented 3 years ago

I did solve it partially. I'm checking if the show more button leads to another site and running article parsing on that page. Otherwise, I'm grabbing the text and title and just assign those values to article.title article.text.

I only care about text and title. I'll post implementation in a day or 2. Using pyquery to do html parsing with css selectors.

Hey mate, can you share the implementation for the fix? Thank you!

calpa commented 3 years ago

Any updates?