Open Tjatse opened 9 years ago
Hi, @mxr576, thanks a lot, there is a bug of setting host
on headers
in req-fast
, I've fixed it and put your issue as a test case under test
directory, it works fine, just update read-art
to latest version and try out.
Thanks for the fast reaction! I was suspicious too, that this should a req-fast issue. I can confirm, that the content extraction works fine on these links now with read-art.
@Tjatse , for URL: http://mp.weixin.qq.com/s?__biz=MjYyMzc1Mjk4MA==&mid=400815255&idx=1&sn=d91b630394b8ba70209406bbf44b41e8&scene=0#wechat_redirect with pictures as article, the result is
<div> <strong class="profile_nickname">搞笑集中营</strong>
<p class="profile_meta"> <span class="profile_meta_value">WeiGaoXiao</span> </p>
<p class="profile_meta"> <span class="profile_meta_value">搞笑段子、搞笑视频、搞笑幽默、搞笑糗事、内涵漫画……等等搞笑的搞笑,这里是搞笑集中营,一网打尽所有的搞笑,让你天天笑哈哈哈哈哈哈哈~</span>
https://medium.com/google-developers/drawing-a-rounded-corner-background-on-text-5a610a95af5 Entire artcile is not crawled
Hi!
I'm using your module in my web crawler, called Web page Content Extractor (wce), and I've recently discovered that the read-art returns with "Error: 400 Bad Request" for these URLs, however the node-readability works on these ones, without any problem. Could you please check them?