LiuXingMing / SinaSpider

新浪微博爬虫(Scrapy、Redis)
3.27k stars 1.52k forks source link

获取weibo.cn部分的cookies的一点建议 #51

Open LichMscy opened 7 years ago

LichMscy commented 7 years ago

其实可以不用进行验证码操作,受作者启发,可以先登录weibo.com的无验证码入口(微博账号安全里设为常登陆地点可以免验证码),然后直接在phontomjs模拟打开weibo.cnweibo.cn会是登录状态,这时候获取cookies便可。

由于我自己实现了,代码如下,仅供参考:

def init_phantomjs_driver():
    headers = {
        'Cookie': 'YF-Ugrow-G0=b02489d329584fca03ad6347fc915997; SUB=_2AkMvgPj2dcPxrAFYnPgWyGvkZYpH-jycVZEAAn7uJhMyOhgv7nBSqSVOKynW2PbhU4768kfRGZgNPwXeRA..; SUBP=0033WrSXqPxfM72wWs9jqgMF55529P9D9WWEFXHsNpvgJdQjr1GM.e765JpVF020SKM7e0571hMc',  # 未登录时weibo.com的cookie
    }
    for key, value in headers.items():
        webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.{}'.format(key)] = value
    useragent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36'
    webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.settings.userAgent'] = useragent

    #   local path refer phantomjs
    driver = webdriver.PhantomJS(executable_path='xxxxxxxphantomjs路径xxxxxxx')
    driver.set_window_size(1366, 768)
    return driver
browser = weibo_auto_handle.init_phantomjs_driver()
    browser.get("http://weibo.com")
    time.sleep(3)
    failure = 0
    while "微博-随时随地发现新鲜事" == browser.title and failure < 5:
        failure += 1
        username = browser.find_element_by_name("username")
        pwd = browser.find_element_by_name("password")
        login_submit = browser.find_element_by_class_name('W_btn_a')
        username.clear()
        username.send_keys(account['usn'])
        pwd.clear()
        pwd.send_keys(account['pwd'])
        login_submit.click()
        time.sleep(5)

        # if browser.find_element_by_class_name('verify').is_displayed():
        #     logging.error("Verify code is needed! (Account: %s)" % account)

    if "我的首页 微博-随时随地发现新鲜事" in browser.title:
        browser.get('http://weibo.cn/')
        cookie = dict()
        if "我的首页" in browser.title:
            for elem in browser.get_cookies():
                cookie[elem["name"]] = elem["value"]
        # p2 = persist_iics.Persist()
        # p2.save_account_cookies(accounts[0][0], cookie, datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
        logging.error('Account cookies updated! (Account_id: %s)' % account['usn'])
        return cookie
LiuXingMing commented 7 years ago

嗯,想法不错,少量作业的情况可以用这个。 但是如果抓取量大的话登录的账号比较多,不可能人工去设置,另外微博对IP有限制,爬得快的要加代理,也不适用。