l0g2 commented 2 months ago

date_str = extract_and_convert_dates(result['publish_time'])

报错 KeyError: 'publish_time'

l0g2 commented 2 months ago

2024-08-21 11:29:40.628 | INFO     | scrapers.general_crawler:general_crawler:152 - gne extract not good: {'title': '', 'author': '', 'publish_time': '', 'content': '%PDF-...
2024-08-21 11:29:40.631 | INFO     | scrapers.general_crawler:general_crawler:165 - https://....pdf content too long for llm parsing
core-1  | Traceback (most recent call last):
core-1  |   File "/app/tasks.py", line 32, in <module>
core-1  |     asyncio.run(main())
core-1  |   File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
core-1  |     return loop.run_until_complete(main)
core-1  |   File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
core-1  |     return future.result()
core-1  |   File "/app/tasks.py", line 30, in main
core-1  |     await schedule_pipeline(interval_seconds)
core-1  |   File "/app/tasks.py", line 20, in schedule_pipeline
core-1  |     await asyncio.gather(*[process_site(site, counter) for site in sites])
core-1  |   File "/app/tasks.py", line 12, in process_site
core-1  |     await pipeline(site['url'].rstrip('/'))
core-1  |   File "/app/insights/__init__.py", line 31, in pipeline
core-1  |     flag, result = await general_crawler(url, logger)
core-1  |   File "/app/scrapers/general_crawler.py", line 208, in general_crawler
core-1  |     date_str = extract_and_convert_dates(result['publish_time'])
core-1  | KeyError: 'publish_time'

根据日志, 似乎是读取pdf引发的.

l0g2 commented 2 months ago

另外, 出现错误后, 程序不能自动恢复运行.

bigbrother666sh commented 2 months ago

88

done

TeamWiseFlow / wiseflow

general_crawler.py 第208行报错 KeyError: 'publish_time' #73

88