Prerequisites

[X] Write a descriptive title.
[X] Make sure you are able to repro it on the latest released version
[X] Search the existing issues, especially the pinned issues.

Exception report

PS E:\Googledownload\WeiboPostAndCommentCrawl-master>  e:; cd 'e:\Googledownload\WeiboPostAndCommentCrawl-master'; & 'D:\conda\envs\xdl\python.exe' 'c:\Users\24170\.vscode\extensions\ms-python.python-2023.6.0\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher' '60541' '--' 'e:\Googledownload\WeiboPostAndCommentCrawl-master\Spider.py
哎呀，出现问题。请在报告此 Bug 时添加以下详细信息。
在 GitHub 上报告: https://github.com/lzybkr/PSReadLine/issues/new
-----------------------------------------------------------------------
上 200 个密钥:
 ' c : \ U s e r s \ 2 4 1 7 0 \ . v s c o d e \ e x t e n s i o n s \ m s - p y t h o n . p y t h o n - 2 0 2 3 . 6 . 0 \ p y t h o n F i l e s \ l i b \ p y t h o n \ d e b u g p y \ a d a p t e r / . . / . . \ d e b u g p y \ l a u n c h e r ' Space ' 6 0 5 4 1 ' Space ' - - ' Space ' e : \ G o o g l e d o w n l o a d \ W e i b o P o s t A n d C o m m e n t C r a w l - m a s t e r \ S p i d e r . p y ' Space Enter

异常:
System.ArgumentOutOfRangeException: 该值必须大于或等于零，且必须小于控制台缓冲区在该维度的大小。
参数名: top
实际值是 -1。
   在 System.Console.SetCursorPosition(Int32 left, Int32 top)
   在 Microsoft.PowerShell.PSConsoleReadLine.ReallyRender(RenderData renderData, String defaultColor)
   在 Microsoft.PowerShell.PSConsoleReadLine.ForceRender()
   在 Microsoft.PowerShell.PSConsoleReadLine.Insert(Char c)
   在 Microsoft.PowerShell.PSConsoleReadLine.SelfInsert(Nullable`1 key, Object arg)
   在 Microsoft.PowerShell.PSConsoleReadLine.ProcessOneKey(ConsoleKeyInfo key, Dictionary`2 dispatchTable, Boolean ignoreIfNoAction, Object arg)
   在 Microsoft.PowerShell.PSConsoleReadLine.InputLoop()
   在 Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics)
-----------------------------------------------------------------------
PS E:\Googledownload\WeiboPostAndCommentCrawl-master>  e:; cd 'e:\Googledownload\WeiboPostAndCommentCrawl-master'; & 'D:\conda\envs\xdl\python.exe' 'c:\Users\24170\.vscode\extensions\ms-python.python-2023.6.0\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher' '60541' '--' 'e:\Googledownload\WeiboPostAndCommentCrawl-master\Spider.py'
哎呀，出现问题。请在报告此 Bug 时添加以下详细信息。
在 GitHub 上报告: https://github.com/lzybkr/PSReadLine/issues/new
-----------------------------------------------------------------------
上 200 个密钥:
 ' c : \ U s e r s \ 2 4 1 7 0 \ . v s c o d e \ e x t e n s i o n s \ m s - p y t h o n . p y t h o n - 2 0 2 3 . 6 . 0 \ p y t h o n F i l e s \ l i b \ p y t h o n \ d e b u g p y \ a d a p t e r / . . / . . \ d e b u g p y \ l a u n c h e r ' Space ' 6 0 5 4 1 ' Space ' - - ' Space ' e : \ G o o g l e d o w n l o a d \ W e i b o P o s t A n d C o m m e n t C r a w l - m a s t e r \ S p i d e r . p y ' Space Enter

异常:
System.ArgumentOutOfRangeException: 该值必须大于或等于零，且必须小于控制台缓冲区在该维度的大小。
参数名: top
实际值是 -1。
   在 System.Console.SetCursorPosition(Int32 left, Int32 top)
   在 Microsoft.PowerShell.PSConsoleReadLine.ReallyRender(RenderData renderData, String defaultColor)
   在 Microsoft.PowerShell.PSConsoleReadLine.ForceRender()
   在 Microsoft.PowerShell.PSConsoleReadLine.Insert(Char c)
   在 Microsoft.PowerShell.PSConsoleReadLine.SelfInsert(Nullable`1 key, Object arg)
   在 Microsoft.PowerShell.PSConsoleReadLine.ProcessOneKey(ConsoleKeyInfo key, Dictionary`2 dispatchTable, Boolean ignoreIfNoAction, Object arg)
   在 Microsoft.PowerShell.PSConsoleReadLine.InputLoop()
   在 Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics)
-----------------------------------------------------------------------

Screenshot

Environment data

win10

Steps to reproduce

from utils import write_source, write_tweet from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.webdriver.chrome.options import Options from selenium.webdriver import ChromeOptions import time from datetime import datetime import re import os import json

tweet_num_limit = 1 # 一次提取的帖子url的数量 comment_num_limit = 600 # 截取的一级评论限制数量 second_comment_num_limit = 600

driver_path = 'C:\Program Files\Google\Chrome\Application\chromedriver.exe'

wb_login_url = 'https://weibo.com/login.php'

content_xpath = '//div[@class="detail_wbtext_4CRf9"]' # 帖子正文 tweet_time_xpath = '//a[@class="head-info_time_6sFQg"]' # 帖子发布时间 comment_num_xpath = '//div[@class="woo-box-item-flex toolbar_item_1ky_D toolbar_cursor_34j5V"]/div/span[@class="toolbar_num_JXZul"]' # 评论数量 comment_list_bottom_xpath = '//div[@class="Bottom_text_1kFLe"]' # 评论列表页面最底部 popup_comment_list_bottom_xpath = '//div[@class="woo-box-flex woo-box-alignCenter woo-box-justifyCenter woo-modal-wrap ReplyModal_wrap_2j1bg"]//div[@class="Bottom_text_1kFLe"]' # 弹窗评论列表页面最底部 close_popup_button_xpath = '//div[@class="woo-box-flex woo-box-alignCenter woo-box-justifyCenter woo-modal-wrap ReplyModal_wrap_2j1bg"]//div[@class="wbpro-layer-tit-opt woo-box-flex woo-box-alignCenter woo-box-justifyCenter"]/i'

tweet_user_xpath = '//a[@class="ALink_default_2ibt1 head_cut_2Zcft head_name_24eEB"]' # 帖子发布用户id

def login(): option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation']) driver = webdriver.Chrome(driver_path, options=option) driver.maximize_window() driver.get(wb_login_url) print('正在打开微博登录页面......')

WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, 'loginname')))
print("成功打开微博登录页面")

input("手动点击登录按钮并验证以后输入任意字符继续：")

print("成功登录")

return driver

获取帖子文本和时间，返回dict

def get_tweet_content(driver): tweet_text = driver.find_element_by_xpath(content_xpath).get_attribute('innerText') tweet_text = tweet_text.replace('
', '\n').replace('\u200b', '').replace('\u2006', ' ').replace( '\u0026\u0071\u0075\u006f\u0074\u003b', '\"').replace('\u0026\u006e\u0062\u0073\u0070\u003b', ' ').replace( '\u0026\u0061\u006d\u0070\u003b', '&') tweet_time = driver.find_element_by_xpath(tweet_time_xpath).get_attribute('innerText')

tweet_dict = {
    'content': tweet_text,
    'time': tweet_time,
    'user id': '',
    'tweet id': ''
}

return tweet_dict

从帖子url中截取用户和帖子的id，返回两个str

def get_user_tweet_id(tweet_url): split_list = tweet_url.split('/')

user_id = split_list[-2]
tweet_id = split_list[-1]

return user_id, tweet_id

def get_tweet_comment(driver): comment_num = driver.find_element_by_xpath(comment_num_xpath).get_attribute('innerText').strip() tip_xpath = '//span[@class="woo-tip-text"]' tip = driver.find_elements_by_xpath(tip_xpath) first = find_comment_view(driver, 0) if comment_num == '评论' or comment_num == 0 or len(first) == 0: return [] if len(tip) != 0 and tip[-1].get_attribute('innerText').strip() == '暂无评论，发表你的评论或看看推荐吧': return []

useful_comment_num = scroll_to_show_enough_comment(driver)
scrolling(driver, 1000)

scrolling_location = 200
comment_index = 0
comment_list = []
for i in range(useful_comment_num):
    comment_view = find_active_comment_view(driver, i)
    if len(comment_view) == 0:
        while 1:
            scrolling(driver, scrolling_location)
            scrolling_location += 200

            find_comment = find_active_comment_view(driver, i)
            if len(find_comment) != 0:
                break

            time.sleep(0.5)

    root_comment = {'comment id': comment_index, 'parent': -1, 'children': []}
    comment_index += 1

    # if i == 10:
    #     write_source(driver)
    #     time.sleep(100)
    root_view_xpath = f'//div[@class="RepostCommentList_mar1_3VHkS"]//div[@class="vue-recycle-scroller__item-view"]/div[@data-index="{i}" and @data-active="true"]'
    root_view = driver.find_element_by_xpath(root_view_xpath)

    # 爬取评论用户id
    user_id_xpath = './/div[@class="item1"]/div[@class="item1in woo-box-flex"]//div[@class="text"]/a[@class="ALink_default_2ibt1"]'
    user_id = root_view.find_element_by_xpath(user_id_xpath)
    root_comment['user id'] = user_id.get_attribute('to').split('/')[-1]
    root_comment['user name'] = user_id.get_attribute('innerText')

    # 爬取评论内容
    content_xpath = './/div[@class="item1"]/div[@class="item1in woo-box-flex"]//div[@class="text"]/span'
    content = root_view.find_element_by_xpath(content_xpath).get_attribute('innerText')
    root_comment['content'] = content

    # 爬取评论时间
    comment_time_xpath = './/div[@class="item1"]/div[@class="item1in woo-box-flex"]//div[@class="info woo-box-flex woo-box-alignCenter woo-box-justifyBetween"]/div[1]'
    comment_time = root_view.find_element_by_xpath(comment_time_xpath).get_attribute('innerText').strip()
    root_comment['time'] = ' '.join(comment_time.split()[:2])

    comment_list.append(root_comment)

    # 评论存在的所有情况：
    # 1. 单个一级评论
    # 2. 有二级评论不需要展开
    # 3. 有二级评论需要展开-评论数量未超过限制
    # 4. 有二级评论需要展开-评论数量超过限制
    second_comment_xpath = root_view_xpath + '//div[@class="item1"]//div[@class="item2"]'
    second_comment = driver.find_elements_by_xpath(second_comment_xpath)
    if len(second_comment) != 0:  # 有二级评论，对应234

        another_unfold_exist = False  # 应付【xxx等人 共x条回复】的情况
        if len(second_comment) == 1:
            find = second_comment[0].find_elements_by_xpath('./div/div/span')
            if len(find) != 0 and re.match('共.*?条回复', find[0].get_attribute('innerText').strip()) != None:
                another_unfold_exist = True

        comment_unfold_xpath = root_view_xpath + '//div[@class="item1"]//div[@class="item2"]/div[@class="text"]/a'
        comment_unfold = driver.find_elements_by_xpath(comment_unfold_xpath)  # 搜索看看有没有折叠的评论
        if len(comment_unfold) == 0 and not another_unfold_exist:  # 对应2
            button1 = second_comment[0].find_element_by_xpath('./div/div/a')
            not_delete_only_one = re.match('共.*?条回复', button1.get_attribute(
                'innerText').strip()) == None  # 这是为了应付用户回复了1条评论又删除掉的情况
            if not_delete_only_one:
                comment_2_num = len(second_comment)
                comment_2_list = []

                for i in range(comment_2_num):
                    comment_2_list.append({'comment id': comment_index, 'parent': -2, 'children': []})
                    comment_index += 1

                for i, item in enumerate(second_comment):
                    user_id_xpath = './div/div/a[@class="ALink_default_2ibt1"]'
                    user_id = item.find_element_by_xpath(user_id_xpath)
                    comment_2_list[i]['user id'] = user_id.get_attribute('to').split('/')[-1]
                    comment_2_list[i]['user name'] = user_id.get_attribute('innerText')

                    content_xpath = './div/div/span'
                    content = item.find_element_by_xpath(content_xpath)
                    comment_2_list[i]['content'] = content.get_attribute('innerText')

                    comment_time_xpath = './div/div[@class="info woo-box-flex woo-box-alignCenter woo-box-justifyBetween"]/div[1]'
                    comment_time = item.find_element_by_xpath(comment_time_xpath).get_attribute('innerText').strip()
                    comment_2_list[i]['time'] = ' '.join(comment_time.split()[:2])

                    is_reply = content.find_elements_by_xpath('./a')
                    if len(second_comment) == 2 and len(is_reply) != 0:
                        child_index = i
                        parent_index = 1 if child_index == 0 else 0
                        comment_2_list[child_index]['parent'] = comment_2_list[parent_index]['comment id']
                        comment_2_list[parent_index]['children'].append(comment_2_list[child_index]['comment id'])
                    else:
                        comment_2_list[i]['parent'] = root_comment['comment id']
                        root_comment['children'].append(comment_2_list[i]['comment id'])

                comment_list += comment_2_list
        else:  # 对应34
            unfold_button = second_comment[0].find_element_by_xpath('./div/div/span') if another_unfold_exist else \
                comment_unfold[0]
            comment_unfold_num = int(unfold_button.get_attribute('innerText').strip()[1:-3])
            if comment_unfold_num <= second_comment_num_limit:  # 对应3
                driver.execute_script("arguments[0].click();", unfold_button)
                WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_xpath(
                    '//div[@class="list2"]/div[@class="Scroll_container_280Ky"]/div[@class="Scroll_wrap_ObsGW"]//div[@data-index="0"]'))
                time.sleep(0.3)

                useful_second_comment_num = scroll_popup_to_show_all_comment(driver)
                scrolling_popup(driver, 0)

                second_scrolling_location = 200
                comment_2_list = []
                for i in range(useful_second_comment_num):
                    second_comment_view = find_active_second_comment_view(driver, i)
                    if len(second_comment_view) == 0:
                        while 1:
                            scrolling_popup(driver, second_scrolling_location)
                            second_scrolling_location += 200

                            find_comment = find_active_second_comment_view(driver, i)
                            if len(find_comment) != 0:
                                break

                            time.sleep(0.5)

                    comment = {'comment id': comment_index, 'parent': -1, 'children': []}
                    comment_index += 1

                    comment_view_xpath = f'//div[@class="woo-box-flex woo-box-alignCenter woo-box-justifyCenter woo-modal-wrap ReplyModal_wrap_2j1bg"]//div[@data-index="{i}"]'
                    comment_view = driver.find_element_by_xpath(comment_view_xpath)

                    # 爬取评论用户id
                    user_id_xpath = './div/div/div/a[@class="ALink_default_2ibt1"]'
                    user_id = comment_view.find_element_by_xpath(user_id_xpath)
                    comment['user id'] = user_id.get_attribute('to').split('/')[-1]
                    comment['user name'] = user_id.get_attribute('innerText')

                    # 爬取评论内容
                    content_xpath = './div/div/div/span'
                    content = comment_view.find_elements_by_xpath(
                        content_xpath)  # 这里使用find_elements是因为微博认证图标也是一个span
                    comment['content'] = content[1].get_attribute('innerText') if len(content) != 1 else content[
                        0].get_attribute('innerText')

                    # 爬取评论时间
                    comment_time_xpath = './div/div/div[@class="info woo-box-flex woo-box-alignCenter woo-box-justifyBetween"]/div'
                    comment_time = comment_view.find_element_by_xpath(comment_time_xpath).get_attribute(
                        'innerText').strip()
                    comment['time'] = ' '.join(comment_time.split()[:2])

                    comment_2_list.append(comment)

                close_popup_button = driver.find_element_by_xpath(close_popup_button_xpath)
                driver.execute_script("arguments[0].click();", close_popup_button)

                comment_2_list = sorted(comment_2_list, reverse=True,
                                        key=lambda x: datetime.strptime(x['time'], "%y-%m-%d %H:%M"))

                for i, item in enumerate(comment_2_list):
                    match_res = re.match('回复@(.*?):', item['content'])
                    if match_res == None:  # 说明是直接回复在一级评论上的
                        item['parent'] = root_comment['comment id']
                        root_comment['children'].append(item['comment id'])
                    else:
                        reply_user_name = match_res.group(1)
                        finded = False
                        for j in range(i + 1, len(comment_2_list)):
                            if comment_2_list[j]['user name'] == reply_user_name:
                                item['parent'] = comment_2_list[j]['comment id']
                                comment_2_list[j]['children'].append(item['comment id'])
                                finded = True
                                break
                        if not finded:
                            item['parent'] = root_comment['comment id']
                            root_comment['children'].append(item['comment id'])

                comment_list += comment_2_list
return comment_list

下拉滚动条，加载多条评论

def scroll_to_show_enough_comment(driver):

useful_comment_num = 0

check_num = 0.1

check_time = 0.1

while 1:

scrolling(driver, 100000)

bottom_comment = find_comment_view(driver, comment_num_limit - 1)

comment_list_bottom = driver.find_elements_by_xpath(comment_list_bottom_xpath)

if len(bottom_comment) != 0:

useful_comment_num = comment_num_limit

break

if len(comment_list_bottom) != 0:

for i in range(comment_num_limit - 1, -1, -1):

comment_view = find_comment_view(driver, i)

if len(comment_view) != 0:

useful_comment_num = i + 1

break

time.sleep(1)

check_num += 1

if check_num == 10:

check_time += 1

if check_miss(driver):

useful_comment_num = comment_num_limit

break

check_num = 0

if check_time > 4:

useful_comment_num = 0

break

把滚动条再拉回顶部，否则评论列表到达最底端的提示显示不出来（一直卡在刷新状态）

scrolling(driver, 0)

time.sleep(2)

return useful_comment_num

def scroll_to_show_enough_comment(driver, max_retry=5): useful_comment_num = 0 retry_times = 0

while retry_times <= max_retry:
    scrolling(driver, 100000)

    bottom_comment = find_comment_view(driver, comment_num_limit - 1)
    comment_list_bottom = driver.find_elements_by_xpath(comment_list_bottom_xpath)

    if len(bottom_comment) != 0:
        useful_comment_num = comment_num_limit
        break

    if len(comment_list_bottom) != 0:
        for i in range(comment_num_limit - 1, -1, -1):
            comment_view = find_comment_view(driver, i)
            if len(comment_view) != 0:
                useful_comment_num = i + 1
                break
        break

    time.sleep(2)
    retry_times += 1

# 把滚动条再拉回顶部，否则评论列表到达最底端的提示显示不出来（一直卡在刷新状态）
scrolling(driver, 0)

return useful_comment_num

def check_miss(driver): comment_view_xpath = f'//div[@class="RepostCommentList_mar1_3VHkS"]//div[@class="vue-recycle-scroller__item-view"]/div[@data-active="true"]' print('check num', int(driver.find_elements_by_xpath(comment_view_xpath)[0].get_attribute("data-index"))) return int(driver.find_elements_by_xpath(comment_view_xpath)[0].get_attribute("data-index")) > comment_num_limit

下拉滚动条，加载弹窗内所有评论

def scroll_popup_to_show_all_comment(driver): useful_comment_num = 0

while 1:
    scrolling_popup(driver, 100000)

    popup_comment_list_bottom = driver.find_elements_by_xpath(popup_comment_list_bottom_xpath)

    if len(popup_comment_list_bottom) != 0:
        for i in range(second_comment_num_limit - 1, -1, -1):
            comment_view = find_second_comment_view(driver, i)
            if len(comment_view) != 0:
                useful_comment_num = i + 1
                break
        break

    # 把滚动条再拉回顶部，否则评论列表到达最底端的提示显示不出来（一直卡在刷新状态）
    scrolling_popup(driver, 0)

    time.sleep(0.5)

return useful_comment_num

下拉滚动条，加载多个帖子

def scroll_to_show_enough_tweet(driver): while 1: scrolling(driver, 100000)

    bottom_comment = find_tweet_view(driver, tweet_num_limit - 1)
    if len(bottom_comment) != 0:
        break

    # 把滚动条再拉回顶部，否则评论列表到达最底端的提示显示不出来（一直卡在刷新状态）
    scrolling(driver, 0)

    time.sleep(0.5)

def scrolling(driver, location): js = f"var q=document.documentElement.scrollTop={location}" driver.execute_script(js)

def scrolling_popup(driver, location): scroll = driver.find_element_by_xpath('//div[@class="ReplyModal_scroll3_2kADQ"]') driver.execute_script(f'arguments[0].scrollTop={location}', scroll)

def find_comment_view(driver, data_index): comment_view_xpath = f'//div[@class="RepostCommentList_mar1_3VHkS"]//div[@class="vue-recycle-scroller__item-view"]/div[@data-index="{data_index}"]' return driver.find_elements_by_xpath(comment_view_xpath)

def find_active_comment_view(driver, data_index): comment_view_xpath = f'//div[@class="RepostCommentList_mar1_3VHkS"]//div[@class="vue-recycle-scroller__item-view"]/div[@data-active="true" and @data-index="{data_index}"]' return driver.find_elements_by_xpath(comment_view_xpath)

def find_second_comment_view(driver, data_index): comment_view_xpath = f'//div[@class="woo-box-flex woo-box-alignCenter woo-box-justifyCenter woo-modal-wrap ReplyModal_wrap_2j1bg"]//div[@data-index="{data_index}"]' return driver.find_elements_by_xpath(comment_view_xpath)

def find_active_second_comment_view(driver, data_index): comment_view_xpath = f'//div[@class="woo-box-flex woo-box-alignCenter woo-box-justifyCenter woo-modal-wrap ReplyModal_wrap_2j1bg"]//div[@data-index="{data_index}" and @data-active="true"]' return driver.find_elements_by_xpath(comment_view_xpath)

def find_tweet_view(driver, data_index): tweet_view_xpath = f'//div[@class="vue-recycle-scroller__item-view"]/div[@data-index="{data_index}"]' # 跟评论的xpath一样 return driver.find_elements_by_xpath(tweet_view_xpath)

def find_active_tweet_view(driver, data_index): tweet_view_xpath = f'//div[@class="vue-recycle-scroller__item-view"]/div[@data-index="{data_index}" and @data-active="true"]' # 跟评论的xpath一样 return driver.find_elements_by_xpath(tweet_view_xpath)

def check_second_display(comment_view): return comment_view.find_element_by_xpath('./..').get_attribute( 'style') != 'transform: translateY(-9999px); z-index: -1;'

tweet_url是一个帖子的内容网页，比如https://www.weibo.com/1883881851/LkJwDqkmO

def crawl_tweet(driver, tweet_url):

driver.get(tweet_url)

WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_xpath(content_xpath))

time.sleep(0.3)

tweet_dict = get_tweet_content(driver)

user_id, tweet_id = get_user_tweet_id(tweet_url)

tweet_dict['user id'] = user_id

tweet_dict['tweet id'] = tweet_id

comment_list = get_tweet_comment(driver)

if comment_list == []:

return None

tweet = {

'source': tweet_dict,

'comment': comment_list

}

将 tweet 保存到文件

dirpath = os.path.join('.', 'Data', theme)

if not os.path.exists(dirpath):

os.makedirs(dirpath)

filepath = os.path.join(dirpath, f'{tweet["source"]["tweet id"]}.json')

with open(filepath, 'w', encoding='utf-8') as f:

json.dump(tweet, f, ensure_ascii=False)

return tweet

def crawl_tweet(driver, tweet_url): driver.get(tweet_url) WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_xpath(content_xpath)) time.sleep(0.3)

tweet_dict = get_tweet_content(driver)
user_id, tweet_id = get_user_tweet_id(tweet_url)
tweet_dict['user id'] = user_id
tweet_dict['tweet id'] = tweet_id

comment_list = get_tweet_comment(driver)

if comment_list == []:
    return None

tweet = {
    'source': tweet_dict,
    'comment': comment_list
}
return tweet

def crawl_theme(driver, crawl_num, theme, theme_url): dirpath = os.path.join('.', 'Data', theme) if not os.path.exists(dirpath): os.makedirs(dirpath)

count_num = 0

while count_num < crawl_num:
    driver.get(theme_url)
    WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_xpath(
        '//div[@class="vue-recycle-scroller__item-view"]/div[@data-index="0"]'))

    tweet_url_list = []
    scrolling_location = 200

    for i in range(tweet_num_limit):
        tweet_view = find_active_tweet_view(driver, i)
        if len(tweet_view) == 0:
            while 1:
                scrolling(driver, scrolling_location)
                scrolling_location += 200

                find_tweet = find_active_tweet_view(driver, i)
                if len(find_tweet) != 0:
                    break

                time.sleep(0.5)

        tweet_url_xpath = f'//div[@class="vue-recycle-scroller__item-view"]/div[@data-index="{i}"]//a[@class="head-info_time_6sFQg"]'
        tweet_url = driver.find_element_by_xpath(tweet_url_xpath)
        tweet_url_list.append(tweet_url.get_attribute('href'))

    for tweet_url in tweet_url_list:
        count_num += 1
        print(count_num, tweet_url)
        tweet = crawl_tweet(driver, tweet_url)
        if tweet != None:
            filepath = os.path.join(dirpath, f'{tweet["source"]["tweet id"]}.json')
            write_tweet(tweet, filepath)

if name == 'main': driver = login() crawl_num = 10000 theme = 'No Theme Long' theme_url = 'https://s.weibo.com/weibo?q=%23%E6%B5%B7%E5%A4%A9%E5%91%B3%E4%B8%9A%E5%9B%9E%E5%BA%94%E9%85%B1%E6%B2%B9%E6%B7%BB%E5%8A%A0%E5%89%82%E4%BA%89%E8%AE%AE%23' crawl_theme(driver, crawl_num, theme, theme_url)

crawl_tweet(driver, test_url)

# while 1:
#     try:
#         crawl_num = 10
#         theme = 'No Theme Long'
#         theme_url = 'https://weibo.com/2783062910/M8bhfq4T6'
#         crawl_theme(driver, crawl_num, theme, theme_url)
#     except Exception:
#         continue

Expected behavior

make the comment about weibo and write

Actual behavior

error

PowerShell / PSReadLine

Have a bug ,I don't solution #3646

Prerequisites

Exception report

Screenshot

Environment data

Steps to reproduce

tweet_user_xpath = '//a[@class="ALink_default_2ibt1 head_cut_2Zcft head_name_24eEB"]' # 帖子发布用户id

获取帖子文本和时间，返回dict

从帖子url中截取用户和帖子的id，返回两个str

下拉滚动条，加载多条评论

def scroll_to_show_enough_comment(driver):

useful_comment_num = 0

check_num = 0.1

check_time = 0.1

while 1:

scrolling(driver, 100000)

bottom_comment = find_comment_view(driver, comment_num_limit - 1)

comment_list_bottom = driver.find_elements_by_xpath(comment_list_bottom_xpath)

if len(bottom_comment) != 0:

useful_comment_num = comment_num_limit

break

if len(comment_list_bottom) != 0:

for i in range(comment_num_limit - 1, -1, -1):

comment_view = find_comment_view(driver, i)

if len(comment_view) != 0:

useful_comment_num = i + 1

break

break

time.sleep(1)

check_num += 1

if check_num == 10:

check_time += 1

if check_miss(driver):

useful_comment_num = comment_num_limit

break

check_num = 0

if check_time > 4:

useful_comment_num = 0

break

把滚动条再拉回顶部，否则评论列表到达最底端的提示显示不出来（一直卡在刷新状态）

scrolling(driver, 0)

time.sleep(2)

return useful_comment_num

下拉滚动条，加载弹窗内所有评论

下拉滚动条，加载多个帖子

tweet_url是一个帖子的内容网页，比如https://www.weibo.com/1883881851/LkJwDqkmO

def crawl_tweet(driver, tweet_url):

driver.get(tweet_url)

WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_xpath(content_xpath))

time.sleep(0.3)

tweet_dict = get_tweet_content(driver)

user_id, tweet_id = get_user_tweet_id(tweet_url)

tweet_dict['user id'] = user_id

tweet_dict['tweet id'] = tweet_id

comment_list = get_tweet_comment(driver)

if comment_list == []:

return None

tweet = {

'source': tweet_dict,

'comment': comment_list

}

将 tweet 保存到文件

dirpath = os.path.join('.', 'Data', theme)

if not os.path.exists(dirpath):

os.makedirs(dirpath)

filepath = os.path.join(dirpath, f'{tweet["source"]["tweet id"]}.json')

with open(filepath, 'w', encoding='utf-8') as f:

json.dump(tweet, f, ensure_ascii=False)

return tweet

crawl_tweet(driver, test_url)

Expected behavior

Actual behavior