Closed l0g2 closed 2 months ago
2024-08-21 11:29:40.628 | INFO | scrapers.general_crawler:general_crawler:152 - gne extract not good: {'title': '', 'author': '', 'publish_time': '', 'content': '%PDF-...
2024-08-21 11:29:40.631 | INFO | scrapers.general_crawler:general_crawler:165 - https://....pdf content too long for llm parsing
core-1 | Traceback (most recent call last):
core-1 | File "/app/tasks.py", line 32, in <module>
core-1 | asyncio.run(main())
core-1 | File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
core-1 | return loop.run_until_complete(main)
core-1 | File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
core-1 | return future.result()
core-1 | File "/app/tasks.py", line 30, in main
core-1 | await schedule_pipeline(interval_seconds)
core-1 | File "/app/tasks.py", line 20, in schedule_pipeline
core-1 | await asyncio.gather(*[process_site(site, counter) for site in sites])
core-1 | File "/app/tasks.py", line 12, in process_site
core-1 | await pipeline(site['url'].rstrip('/'))
core-1 | File "/app/insights/__init__.py", line 31, in pipeline
core-1 | flag, result = await general_crawler(url, logger)
core-1 | File "/app/scrapers/general_crawler.py", line 208, in general_crawler
core-1 | date_str = extract_and_convert_dates(result['publish_time'])
core-1 | KeyError: 'publish_time'
根据日志, 似乎是读取pdf引发的.
另外, 出现错误后, 程序不能自动恢复运行.
done
core/scrapers/general_crawler.py第208行
报错 KeyError: 'publish_time'