Python3WebSpider / ScrapeSpa2

Spider for https://spa2.scrape.center
4 stars 19 forks source link

selenium.py运行出错 #2

Open soen0905 opened 2 years ago

soen0905 commented 2 years ago

我在尝试运行时报错,个人猜测不知道是生成器的哪里出现了问题,也有可能是版本问题(windows10系统,python版本为3.8)

def parse_index():
    elements = browser.find_elements_by_css_selector('#index .item .name')
    for element in elements:
        href = element.get_attribute('href')
        yield urljoin(INDEX_URL, href)

我的报错显示如下:

2022-07-19 20:52:44,940 - INFO:scraping https://spa2.scrape.center/page/1
2022-07-19 20:52:48,127 - INFO:detail url https://spa2.scrape.center/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx
2022-07-19 20:52:48,127 - INFO:scraping https://spa2.scrape.center/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx
abcd
2022-07-19 20:52:49,963 - INFO:detail data {'url': 'https://spa2.scrape.center/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx', 'name': '霸王别姬 - Farewell My Concubine', 'categories': ['剧情', '爱情'], 'cover': 'https://p0.meituan.net/movie/ce4da3e03e655b5b88ed31b5cd7896cf62472.jpg@464w_644h_1e_1c', 'score': '9.5', 'drama': '影片借一出《霸王别姬》的京戏,牵扯出三个人之间一段随时代风云变幻的爱恨情仇。段小楼(张丰毅 饰)与程蝶衣(张国荣 饰)是一对打小一起长大的师兄弟,两人一个演生,一个饰旦,一向配合天衣无缝,尤其一出《霸王别姬》,更是誉满京城,为此,两人约定合演一辈子《霸王别姬》。但两人对戏剧与人生关系的理解有本质不同,段小楼深知戏非人生,程蝶衣则是人戏不分。段小楼在认为该成家立业之时迎娶了名妓菊仙(巩俐 饰),致使程蝶衣认定菊仙是可耻的第三者,使段小楼做了叛徒,自此,三人围绕一出《霸王别姬》生出的爱恨情仇战开始随着时代风云的变迁不断升级,终酿成悲剧。'}
Traceback (most recent call last):
  File "D:/kinds_work/python_work/spider/第七章/selenium_spider/scrape_Spa2.py", line 93, in <module>
    main()
  File "D:/kinds_work/python_work/spider/第七章/selenium_spider/scrape_Spa2.py", line 81, in main
    for detail_url in detail_urls:
  File "D:/kinds_work/python_work/spider/第七章/selenium_spider/scrape_Spa2.py", line 45, in parse_index
    href = element.get_attribute('href')
  File "E:\anaconda\envs\spider\lib\site-packages\selenium\webdriver\remote\webelement.py", line 139, in get_attribute
    attributeValue = self.parent.execute_script(
  File "E:\anaconda\envs\spider\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 634, in execute_script
    return self.execute(command, {
  File "E:\anaconda\envs\spider\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "E:\anaconda\envs\spider\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=103.0.5060.114)

当我尝试将原代码转为

def parse_index():
    temp = []
    elements = browser.find_elements_by_css_selector('#index .item .name')
    for element in elements:
        href = element.get_attribute('href')
        temp.append(href)
    return temp
        # yield urljoin(INDEX_URL, href)

后,程序可以正常运行,我实在无法理解为什么会出现这样的问题。

尝试过调试该段代码,在第二次for循环中对于element.get_attribute('href')中element对象的传入没有问题。

希望大佬能拨冗解答我的疑问

hefeng61 commented 1 year ago

image这块将生成器转为了list,但我不清楚为什么要这样,前面的例子也没有这样的操作

soen0905 commented 1 year ago

我超级就没有看这玩意了,谷歌给我的答案是:可能在于list后detail_urls就全部加载入内存了,这样会不卡在这个地方?或者说方便调试?感觉使用in访问生成器中的值,或者说可能出问题? image whatever,,,,just guess. XD