Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
343 stars 36 forks source link

links that read-art can not crawl #1

Open Tjatse opened 9 years ago

mxr576 commented 9 years ago

Hi!

I'm using your module in my web crawler, called Web page Content Extractor (wce), and I've recently discovered that the read-art returns with "Error: 400 Bad Request" for these URLs, however the node-readability works on these ones, without any problem. Could you please check them?

Tjatse commented 9 years ago

Hi, @mxr576, thanks a lot, there is a bug of setting host on headers in req-fast, I've fixed it and put your issue as a test case under test directory, it works fine, just update read-art to latest version and try out.

mxr576 commented 9 years ago

Thanks for the fast reaction! I was suspicious too, that this should a req-fast issue. I can confirm, that the content extraction works fine on these links now with read-art.

entertainyou commented 8 years ago

@Tjatse , for URL: http://mp.weixin.qq.com/s?__biz=MjYyMzc1Mjk4MA==&mid=400815255&idx=1&sn=d91b630394b8ba70209406bbf44b41e8&scene=0#wechat_redirect with pictures as article, the result is

<div> <strong class="profile_nickname">搞笑集中营</strong>
<p class="profile_meta"> <span class="profile_meta_value">WeiGaoXiao</span> </p>
<p class="profile_meta"> <span class="profile_meta_value">搞笑段子、搞笑视频、搞笑幽默、搞笑糗事、内涵漫画……等等搞笑的搞笑,这里是搞笑集中营,一网打尽所有的搞笑,让你天天笑哈哈哈哈哈哈哈~</span>
FarmaanElahi commented 6 years ago

https://medium.com/google-developers/drawing-a-rounded-corner-background-on-text-5a610a95af5 Entire artcile is not crawled