codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.2k stars 2.12k forks source link

Redirect should follow meta refresh #260

Closed adamn closed 8 years ago

adamn commented 8 years ago

If newspaper goes to a page like this:

https://www.google.com/url?rct=j&sa=t&url=http://sfbay.craigslist.org/eby/cto/5617800926.html&ct=ga&cd=CAAYATIaYTc4ZTgzYjAwOTAwY2M4Yjpjb206ZW46VVM&usg=AFQjCNF7zAl6JPuEsV4PbEzBomJTUpX4Lg

It receives HTML like this:

<script>window.googleJavaScriptRedirect=1</script><script>var n={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};n.navigateTo(window.parent,window,"http://sfbay.craigslist.org/eby/cto/5617800926.html");
</script><noscript><META http-equiv="refresh" content="0;URL='http://sfbay.craigslist.org/eby/cto/5617800926.html'"></noscript>

Which I got from a Google Alert feed:

https://www.google.com/alerts/feeds/02224275995138650773/15887173320590421756

Then it does not follow the meta refresh link inside the HTML.

The underlying Requests library can't see HTML so I think it makes sense for Newspaper to handle this situation with a new flag (follow_meta_refresh ?) that would default to False because of the performance implications.

yprez commented 8 years ago

@adamn newspaper is using requests for getting the html, it doesn't execute any javascript on the page. It causes the redirect bugs and some issues with dynamic pages.

If you want to download dynamic pages, you can try downloading the html with a headless browser like phantom.js and then pass it's results to newspaper. Parsing JS is a big issue, but I don't see it being added to newspaper anytime soon.

yprez commented 8 years ago

Maybe there's a way to plug a different method of downloading articles than requests, without breaking the current API. That could be useful...

adamn commented 8 years ago

Sorry, I wasn't thinking when I wrote the ticket. What I meant to imply is that this would have to be a condition attached to download(), not parse() - and so should default to False since afaik download() doesn't parse the html currently.

However, for users that want to follow the meta refresh tag, it could really make life easier if Newspaper handled it rather than each dev writing custom code.

I would not suggest interpreting any javascript, just parsing the html and looking for <meta http-equiv="refresh" /> and parsing the content value.

A wrinkle here is that the refresh might be intended to reload the page regularly for advertising or updates so the content value should only be followed if the time is 0.

yprez commented 8 years ago

Oh, I misread this one, sorry...

The meta refresh makes more sense. So year, basically it should be handled in download() -> nework.get_html(). Feels a bit weird sticking parsing code in there. I'm not sure how to handle this one to be honest.

How often are you seeing pages like this? I don't think I've seen many of meta redirects around, but then again, I'm providing the final article URLs to newspaper from other sources most of the time...

adamn commented 8 years ago

I agree it's weird to parse the HTML inside the download() method ... but it seems like the only solution since the abstraction that redirects only happen in HTTP is wrong :-/

We only see this on the Google Alerts rss feeds ... but we have alot of them.

On Tue, Jun 7, 2016 at 10:27 AM, Yuri Prezument notifications@github.com wrote:

Oh, I misread this one, sorry...

The meta refresh makes more sense. So year, basically it should be handled in download() -> nework.get_html(). Feels a bit weird sticking parsing code in there. I'm not sure how to handle this one to be honest.

How often are you seeing pages like this? I don't think I've seen many of meta redirects around, but then again, I'm providing the final article URLs to newspaper from other sources most of the time...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/codelucas/newspaper/issues/260#issuecomment-224297518, or mute the thread https://github.com/notifications/unsubscribe/AADyyglK5umQSia8hpy5RwybIZMtwGWoks5qJX--gaJpZM4IvUNT .

yprez commented 8 years ago

@codelucas what do you think?

logandhead commented 8 years ago

Just created a pull request that supports this. https://github.com/codelucas/newspaper/pull/263

codelucas commented 8 years ago

@yprez I think the PR from @logandhead is a good intermediate solution. Feel free to merge it if you agree :)

codelucas commented 8 years ago

I'm worried that long-term as more websites become less and less static (via .js) our library will lose out on scraping accuracy. It would be cool if download() became smarter and downloaded HTML in multiple rounds of requests depending if the HTML/JS needed to be parsed and redirected.

But I agree with you guys for now that download()'s default behavior (without any arguments) should definitely just be downloading html with just one request, no parsing.

logandhead commented 8 years ago

Right... Rendering javascript built apps that are using frameworks like angular and react would be a nice feature to support. The future of web applications are definitely heading in that direction where most of the app is built via js. Perhaps it could be an optional argument in the future. The only downside is it will slow down the scraping side.

yprez commented 8 years ago

Closed via #263