Bug of Gerapy Auto Extractor 爬取论坛帖子时候出错 - Githubissues

Gerapy / GerapyAutoExtractor

Auto Extractor Module

https://pypi.org/project/gerapy-auto-extractor/

Apache License 2.0

321 stars 79 forks source link

Bug of Gerapy Auto Extractor 爬取论坛帖子时候出错 #4

Open bowu678 opened 4 years ago

bowu678 commented 4 years ago

爬取的链接是：https://www.19lou.com/forum-269-1.html 使用extract_list获取到的数据是： [ { "title": "19楼帮帮团维权月来啦！7月维权主题汽车类", "url": "http://www.19lou.com/forum-79-thread-42261592790646553-1-1.html" }, { "title": "19楼帮帮团来咯，求助维权攻略请收下！", "url": "http://www.19lou.com/forum-79-thread-82281589267909116-1-1.html" }, { "title": "【19楼帮帮团】每日诈骗连载！少点套路，多点幸福", "url": "http://www.19lou.com/forum-79-thread-82681592968362354-1-1.html" }, { "title": "杭州人杭州事，你要知道的都在19楼", "url": "http://www.19lou.com/forum-269-thread-63421567731405299-1-1.html" }, { "title": "楼外楼：杭州事【总版规】（本版不支持一切形式广告）", "url": "http://www.19lou.com/forum-269-thread-31532348-1-1.html" } ]

使用extract_detail获取的是： { "title": "", "datetime": "2020-07-12 00:55:56", "content": "浙公网安备 33010002000029号" }

没有一个是想要的数据，想要的是帖子标题加链接加帖子正文。

Germey commented 4 years ago

感谢，目前的优化都是针对新闻页面，对论坛支持可能不好，我会继续优化。