TeamWiseFlow / wiseflow

Wiseflow is an agile information mining tool that extracts concise messages from various sources such as websites, WeChat official accounts, social platforms, etc. It automatically categorizes and uploads them to the database.
Other
4.87k stars 821 forks source link

写了Facebook专有解析器,但是根据日志查看还是使用的通用解析器。 #127

Open linopluss opened 1 week ago

linopluss commented 1 week ago

请大神指点为何专有解析器未被调用

解析器注册 'from urllib.parse import urlparse from .mp_crawler import mp_crawler from .facebook_parser import facebook_parser

def get_scraper(url): domain = urlparse(url).netloc.replace('www.', '') return scraper_map.get(domain)

scraper_map = {'mp.weixin.qq.com': mp_crawler, 'facebook.com': facebook_parser}'

运行日志 2024-11-18 12:03:48 core-1 | 2024-11-18 01:03:48.853 | DEBUG | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090 2024-11-18 12:03:48 core-1 | 2024-11-18 01:03:48.925 | DEBUG | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.202 | INFO | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - linoplus.chen@gmail.com 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | INFO | __main__:schedule_pipeline:19 - task execute loop 1 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | INFO | __main__:process_site:11 - applying https://www.facebook.com/SamsungAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | DEBUG | insights:pipeline:34 - start processing https://www.facebook.com/SamsungAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.243 | INFO | __main__:process_site:11 - applying https://www.facebook.com/LGAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.243 | DEBUG | insights:pipeline:34 - start processing https://www.facebook.com/LGAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.257 | INFO | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - linoplus.chen@gmail.com 2024-11-18 12:03:49 core-1 | INFO: Started server process [18] 2024-11-18 12:03:49 core-1 | INFO: Waiting for application startup. 2024-11-18 12:03:49 core-1 | INFO: Application startup complete. 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.547 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach 2024-11-18 12:03:49 core-1 | Client error '400 Bad Request' for url 'https://www.facebook.com/SamsungAustralia' 2024-11-18 12:03:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:03:49 core-1 | waiting 1min 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.566 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach 2024-11-18 12:03:49 core-1 | Client error '400 Bad Request' for url 'https://www.facebook.com/LGAustralia' 2024-11-18 12:03:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:03:49 core-1 | waiting 1min 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.837 | ERROR | scrapers.general_crawler:general_crawler:101 - Client error '400 Bad Request' for url 'https://www.facebook.com/LGAustralia' 2024-11-18 12:04:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.841 | ERROR | insights:pipeline:44 - got article failed, pipeline abort 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.872 | ERROR | scrapers.general_crawler:general_crawler:101 - Client error '400 Bad Request' for url 'https://www.facebook.com/SamsungAustralia' 2024-11-18 12:04:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.873 | ERROR | insights:pipeline:44 - got article failed, pipeline abort 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.874 | INFO | __main__:schedule_pipeline:23 - task execute loop finished, work after 3600 seconds

bigbrother666sh commented 1 week ago

scraper_map = {'mp.weixin.qq.com': mp_crawler, 'www.facebook.com': facebook_parser}' 试试看