Open linopluss opened 1 week ago
请大神指点为何专有解析器未被调用
解析器注册 'from urllib.parse import urlparse from .mp_crawler import mp_crawler from .facebook_parser import facebook_parser
def get_scraper(url): domain = urlparse(url).netloc.replace('www.', '') return scraper_map.get(domain)
scraper_map = {'mp.weixin.qq.com': mp_crawler, 'facebook.com': facebook_parser}'
运行日志 2024-11-18 12:03:48 core-1 | 2024-11-18 01:03:48.853 | DEBUG | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090 2024-11-18 12:03:48 core-1 | 2024-11-18 01:03:48.925 | DEBUG | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.202 | INFO | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - linoplus.chen@gmail.com 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | INFO | __main__:schedule_pipeline:19 - task execute loop 1 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | INFO | __main__:process_site:11 - applying https://www.facebook.com/SamsungAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | DEBUG | insights:pipeline:34 - start processing https://www.facebook.com/SamsungAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.243 | INFO | __main__:process_site:11 - applying https://www.facebook.com/LGAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.243 | DEBUG | insights:pipeline:34 - start processing https://www.facebook.com/LGAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.257 | INFO | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - linoplus.chen@gmail.com 2024-11-18 12:03:49 core-1 | INFO: Started server process [18] 2024-11-18 12:03:49 core-1 | INFO: Waiting for application startup. 2024-11-18 12:03:49 core-1 | INFO: Application startup complete. 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.547 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach 2024-11-18 12:03:49 core-1 | Client error '400 Bad Request' for url 'https://www.facebook.com/SamsungAustralia' 2024-11-18 12:03:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:03:49 core-1 | waiting 1min 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.566 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach 2024-11-18 12:03:49 core-1 | Client error '400 Bad Request' for url 'https://www.facebook.com/LGAustralia' 2024-11-18 12:03:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:03:49 core-1 | waiting 1min 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.837 | ERROR | scrapers.general_crawler:general_crawler:101 - Client error '400 Bad Request' for url 'https://www.facebook.com/LGAustralia' 2024-11-18 12:04:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.841 | ERROR | insights:pipeline:44 - got article failed, pipeline abort 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.872 | ERROR | scrapers.general_crawler:general_crawler:101 - Client error '400 Bad Request' for url 'https://www.facebook.com/SamsungAustralia' 2024-11-18 12:04:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.873 | ERROR | insights:pipeline:44 - got article failed, pipeline abort 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.874 | INFO | __main__:schedule_pipeline:23 - task execute loop finished, work after 3600 seconds
2024-11-18 12:03:48 core-1 | 2024-11-18 01:03:48.853 | DEBUG | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090 2024-11-18 12:03:48 core-1 | 2024-11-18 01:03:48.925 | DEBUG | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.202 | INFO | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - linoplus.chen@gmail.com 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | INFO | __main__:schedule_pipeline:19 - task execute loop 1 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | INFO | __main__:process_site:11 - applying https://www.facebook.com/SamsungAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | DEBUG | insights:pipeline:34 - start processing https://www.facebook.com/SamsungAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.243 | INFO | __main__:process_site:11 - applying https://www.facebook.com/LGAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.243 | DEBUG | insights:pipeline:34 - start processing https://www.facebook.com/LGAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.257 | INFO | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - linoplus.chen@gmail.com 2024-11-18 12:03:49 core-1 | INFO: Started server process [18] 2024-11-18 12:03:49 core-1 | INFO: Waiting for application startup. 2024-11-18 12:03:49 core-1 | INFO: Application startup complete. 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.547 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach 2024-11-18 12:03:49 core-1 | Client error '400 Bad Request' for url 'https://www.facebook.com/SamsungAustralia' 2024-11-18 12:03:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:03:49 core-1 | waiting 1min 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.566 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach 2024-11-18 12:03:49 core-1 | Client error '400 Bad Request' for url 'https://www.facebook.com/LGAustralia' 2024-11-18 12:03:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:03:49 core-1 | waiting 1min 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.837 | ERROR | scrapers.general_crawler:general_crawler:101 - Client error '400 Bad Request' for url 'https://www.facebook.com/LGAustralia' 2024-11-18 12:04:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.841 | ERROR | insights:pipeline:44 - got article failed, pipeline abort 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.872 | ERROR | scrapers.general_crawler:general_crawler:101 - Client error '400 Bad Request' for url 'https://www.facebook.com/SamsungAustralia' 2024-11-18 12:04:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.873 | ERROR | insights:pipeline:44 - got article failed, pipeline abort 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.874 | INFO | __main__:schedule_pipeline:23 - task execute loop finished, work after 3600 seconds
scraper_map = {'mp.weixin.qq.com': mp_crawler, 'www.facebook.com': facebook_parser}' 试试看
请大神指点为何专有解析器未被调用
解析器注册 'from urllib.parse import urlparse from .mp_crawler import mp_crawler from .facebook_parser import facebook_parser
def get_scraper(url): domain = urlparse(url).netloc.replace('www.', '') return scraper_map.get(domain)
scraper_map = {'mp.weixin.qq.com': mp_crawler, 'facebook.com': facebook_parser}'
运行日志
2024-11-18 12:03:48 core-1 | 2024-11-18 01:03:48.853 | DEBUG | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090 2024-11-18 12:03:48 core-1 | 2024-11-18 01:03:48.925 | DEBUG | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.202 | INFO | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - linoplus.chen@gmail.com 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | INFO | __main__:schedule_pipeline:19 - task execute loop 1 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | INFO | __main__:process_site:11 - applying https://www.facebook.com/SamsungAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.222 | DEBUG | insights:pipeline:34 - start processing https://www.facebook.com/SamsungAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.243 | INFO | __main__:process_site:11 - applying https://www.facebook.com/LGAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.243 | DEBUG | insights:pipeline:34 - start processing https://www.facebook.com/LGAustralia 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.257 | INFO | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - linoplus.chen@gmail.com 2024-11-18 12:03:49 core-1 | INFO: Started server process [18] 2024-11-18 12:03:49 core-1 | INFO: Waiting for application startup. 2024-11-18 12:03:49 core-1 | INFO: Application startup complete. 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.547 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach 2024-11-18 12:03:49 core-1 | Client error '400 Bad Request' for url 'https://www.facebook.com/SamsungAustralia' 2024-11-18 12:03:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:03:49 core-1 | waiting 1min 2024-11-18 12:03:49 core-1 | 2024-11-18 01:03:49.566 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach 2024-11-18 12:03:49 core-1 | Client error '400 Bad Request' for url 'https://www.facebook.com/LGAustralia' 2024-11-18 12:03:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:03:49 core-1 | waiting 1min 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.837 | ERROR | scrapers.general_crawler:general_crawler:101 - Client error '400 Bad Request' for url 'https://www.facebook.com/LGAustralia' 2024-11-18 12:04:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.841 | ERROR | insights:pipeline:44 - got article failed, pipeline abort 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.872 | ERROR | scrapers.general_crawler:general_crawler:101 - Client error '400 Bad Request' for url 'https://www.facebook.com/SamsungAustralia' 2024-11-18 12:04:49 core-1 | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.873 | ERROR | insights:pipeline:44 - got article failed, pipeline abort 2024-11-18 12:04:49 core-1 | 2024-11-18 01:04:49.874 | INFO | __main__:schedule_pipeline:23 - task execute loop finished, work after 3600 seconds