TeamWiseFlow / wiseflow

Wiseflow is an agile information mining tool that extracts concise messages from various sources such as websites, WeChat official accounts, social platforms, etc. It automatically categorizes and uploads them to the database.
Other
3.57k stars 554 forks source link

not mp format #13

Closed colin4k closed 1 month ago

colin4k commented 1 month ago

合集地址:http://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect 结果后台报错,“not mp format”,那到底什么样的才算合格的合集地址?

core-1 | 2024-07-09 14:38:41.742 | DEBUG | insights:pipeline:28 - start processing http://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect core-1 | 2024-07-09 14:38:42.114 | WARNING | scrapers.mp_crawler:mp_crawler:60 - not mp format: https://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect core-1 | 'NoneType' object has no attribute 'text' core-1 | 2024-07-09 14:38:42.114 | ERROR | insights:pipeline:38 - got article failed, pipeline abort

colin4k commented 1 month ago

完整错误日志如下: Attaching to core-1
core-1 | bash: warning: setlocale: LC_ALL: cannot change locale (zhCN.UTF-8)
core-1 | 2024/07/09 15:04:16 Server started at http://0.0.0.0:8090
core-1 | ├─ REST API: http://0.0.0.0:8090/api/
core-1 | └─ Admin UI: http://0.0.0.0:8090/
/
core-1 | INFO: Will watch for changes in these directories: ['/app']
core-1 | INFO: Uvicorn running on http://0.0.0.0:8077 (Press CTRL+C to quit)
core-1 | INFO: Started reloader process [1] using WatchFiles
core-1 | 2024-07-09 15:04:17.257 | DEBUG | utils.pb_api:init:12 - initializing pocketbase client: http://host.docker.internal:8090
core-1 | 2024-07-09 15:04:17.538 | INFO | utils.pb_api:init:22 - pocketbase ready authenticated as admin - xx@xx.com
core-1 | 2024-07-09 15:04:17.550 | INFO | main:schedule_pipeline:19 - task execute loop 1
core-1 | 2024-07-09 15:04:17.550 | INFO | main:process_site:11 - applying https://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18&uin=&key=&lang=zh_CN&ascene=7
core-1 | 2024-07-09 15:04:17.550 | DEBUG | insights:pipeline:28 - start processing https://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18&uin=&key=&lang=zh_CN&ascene=7
core-1 | 2024-07-09 15:04:17.588 | DEBUG | utils.pb_api:init:12 - initializing pocketbase client: http://host.docker.internal:8090
core-1 | 2024-07-09 15:04:17.806 | INFO | utils.pb_api:init:22 - pocketbase ready authenticated as admin - xx@xx.com
core-1 | 2024-07-09 15:04:17.817 | WARNING | scrapers.mp_crawler:mp_crawler:60 - not mp format: https://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18&uin=&key=&lang=zh_CN&ascene=7
core-1 | 'NoneType' object has no attribute 'text'
core-1 | 2024-07-09 15:04:17.817 | ERROR | insights:pipeline:38 - got article failed, pipeline abort
core-1 | 2024-07-09 15:04:17.817 | INFO | main:schedule_pipeline:23 - task execute loop finished, work after 3600 seconds

wwz223 commented 1 month ago

合集地址:http://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect 结果后台报错,“not mp format”,那到底什么样的才算合格的合集地址?

core-1 | 2024-07-09 14:38:41.742 | DEBUG | insights:pipeline:28 - start processing http://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect core-1 | 2024-07-09 14:38:42.114 | WARNING | scrapers.mp_crawler:mp_crawler:60 - not mp format: https://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect core-1 | 'NoneType' object has no attribute 'text' core-1 | 2024-07-09 14:38:42.114 | ERROR | insights:pipeline:38 - got article failed, pipeline abort

@colin4k 目前wiseflow这一版本还不支持公众号文章列表,只支持单一公众号文章,具体原因是获取不到公众号文章列表的URL集合,如果有好的方案欢迎交流👏🏻👏🏻

如果需要获取公众号文章列表的话,需要搭配wxbot来实现,后续可以关注我们接下来的一个开源项目,会提供整合方案

wwz223 commented 1 month ago

@colin4k 我看你的链接是能获取到url集合的,是从哪里获取到的呢,还是有特定的拼接逻辑

colin4k commented 1 month ago

@colin4k 我看你的链接是能获取到url集合的,是从哪里获取到的呢,还是有特定的拼接逻辑

就是该公众号里自己的专栏

colin4k commented 1 month ago

合集地址:http://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect 结果后台报错,“not mp format”,那到底什么样的才算合格的合集地址? core-1 | 2024-07-09 14:38:41.742 | DEBUG | insights:pipeline:28 - start processing http://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect core-1 | 2024-07-09 14:38:42.114 | WARNING | scrapers.mp_crawler:mp_crawler:60 - not mp format: https://mp.weixin.qq.com/mp/homepage?__biz=MzAwMzA1NzgwNg==&hid=3&sn=6e5c34ecf07d8cd3d277cf4459fcb015&scene=18#wechat_redirect core-1 | 'NoneType' object has no attribute 'text' core-1 | 2024-07-09 14:38:42.114 | ERROR | insights:pipeline:38 - got article failed, pipeline abort

@colin4k 目前wiseflow这一版本还不支持公众号文章列表,只支持单一公众号文章,具体原因是获取不到公众号文章列表的URL集合,如果有好的方案欢迎交流👏🏻👏🏻

如果需要获取公众号文章列表的话,需要搭配wxbot来实现,后续可以关注我们接下来的一个开源项目,会提供整合方案

这就是单一公众号里自己的专栏的url

wwz223 commented 1 month ago

@colin4k 我理解你的需求还是想读取某一公众号下的所有文章列表信息,目前是不支持的,需要将文章的url 挨个 添加到pocketbase 站点的sites菜单栏下面

wwz223 commented 1 month ago

@colin4k 我看你的链接是能获取到url集合的,是从哪里获取到的呢,还是有特定的拼接逻辑

就是该公众号里自己的专栏

如果想获取某个公众号的文章列表对应的url具体要怎么操作呢

这个将决定后续版本增加公众号监控功能的设计实现

colin4k commented 1 month ago

@colin4k 我理解你的需求还是想读取某一公众号下的所有文章列表信息,目前是不支持的,需要将文章的url 挨个 添加到pocketbase 站点的sites菜单栏下面

如果还要手动输入文章就失去了自动抓取的意义了吧?是不是可以自动监测某个合集的url,看看是否有新增的文章然后进行抓取并解析?

colin4k commented 1 month ago

@colin4k 我看你的链接是能获取到url集合的,是从哪里获取到的呢,还是有特定的拼接逻辑

就是该公众号里自己的专栏

如果想获取某个公众号的文章列表对应的url具体要怎么操作呢

这个将决定后续版本增加公众号监控功能的设计实现

公众号都支持定义合集的,进到公众号下面的菜单栏就可以看到,然后复制url出来,所以建议增加对这种合集url的支持

colin4k commented 1 month ago

另外还可以尝试利用搜狗微信,比如这个url:https://gzh.sogou.com/weixin?type=2&s_from=input&query=%E4%BF%A1%E7%94%A8%E5%8D%A1&ie=utf8 我试着输入到sites里面,可以抓取,但是效果不是很好,感觉可以针对性优化一下

bigbrother666sh commented 3 weeks ago

55