Gerapy / GerapyAutoExtractor

Auto Extractor Module
https://pypi.org/project/gerapy-auto-extractor/
Apache License 2.0
321 stars 79 forks source link

Bug of Gerapy Auto Extractor 爬取论坛帖子时候出错 #4

Open bowu678 opened 4 years ago

bowu678 commented 4 years ago

爬取的链接是:https://www.19lou.com/forum-269-1.html 使用extract_list获取到的数据是: [ { "title": "19楼帮帮团维权月来啦!7月维权主题汽车类", "url": "http://www.19lou.com/forum-79-thread-42261592790646553-1-1.html" }, { "title": "19楼帮帮团来咯,求助维权攻略请收下!", "url": "http://www.19lou.com/forum-79-thread-82281589267909116-1-1.html" }, { "title": "【19楼帮帮团】每日诈骗连载!少点套路,多点幸福", "url": "http://www.19lou.com/forum-79-thread-82681592968362354-1-1.html" }, { "title": "杭州人杭州事,你要知道的都在19楼", "url": "http://www.19lou.com/forum-269-thread-63421567731405299-1-1.html" }, { "title": "楼外楼:杭州事【总版规】(本版不支持一切形式广告)", "url": "http://www.19lou.com/forum-269-thread-31532348-1-1.html" } ]

使用extract_detail获取的是: { "title": "", "datetime": "2020-07-12 00:55:56", "content": "浙公网安备 33010002000029号" }

没有一个是想要的数据,想要的是帖子标题加链接加帖子正文。

Germey commented 4 years ago

感谢,目前的优化都是针对新闻页面,对论坛支持可能不好,我会继续优化。