GZhangjl / zhihu_spider

这是一个zhihu爬虫,涉及到scrapy多块知识,同时涉及几项应对反爬虫设计,最后使用sqlalchemy将爬得结果进行数据库导入。
11 stars 1 forks source link

登录时会遭遇重定向导致首页获取失败 #2

Closed qiezigao closed 6 years ago

qiezigao commented 6 years ago

您好,在使用您的代码进行模拟登陆操作和另外一个模拟登陆操作都会在登录完成请求首页时被重定向,请问如何解决?麻烦了。以下为日志

2018-08-26 10:20:31 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2018-08-26 10:20:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2018-08-26 10:20:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to  from 
2018-08-26 10:20:32 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: https://www.zhizhu.com)
2018-08-26 10:20:32 [scrapy.core.engine] INFO: Closing spider (finished)

以下为另外一种模拟登陆方案,也会被重定向

 # -*- coding: utf-8 -*-
 __author__ = 'Mark'
 __date__ = '2018/4/15 10:18'

 import hmac
 import json
 import scrapy
 import time
 import base64
 from hashlib import sha1

 class ZhihuLoginSpider(scrapy.Spider):
     name = 'zhihu'
     allowed_domains = ['www.zhihu.com']
     start_urls = ['http://www.zhihu.com/']
     agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
     # agent = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'
     headers = {
         'Connection': 'keep-alive',
         'Host': 'www.zhihu.com',
         'Referer': 'https://www.zhihu.com/signup?next=%2F',
         'User-Agent': agent,
         'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
     }
     grant_type = 'password'
     client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
     source = 'com.zhihu.web'
     timestamp = str(int(time.time() * 1000))
     timestamp2 = str(time.time() * 1000)
     print(timestamp2)

     def get_signature(self, grant_type, client_id, source, timestamp):
         """處理簽名"""
         hm = hmac.new(b'd1b964811afb40118a12068ff74a12f4', None, sha1)
         hm.update(str.encode(grant_type))
         hm.update(str.encode(client_id))
         hm.update(str.encode(source))
         hm.update(str.encode(timestamp))
         return str(hm.hexdigest())

     def parse(self, response):
         print(response.body.decode("utf-8"))

     def start_requests(self):
         yield scrapy.Request('https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
                              headers=self.headers, callback=self.is_need_capture)

     def is_need_capture(self, response):
         print(response.text)
         need_cap = json.loads(response.body.decode('utf-8'))['show_captcha']
         print(need_cap)

         if need_cap:
             print('需要驗證碼')
             yield scrapy.Request(
                 url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
                 headers=self.headers,
                 callback=self.capture,
                 method='PUT'
             )
         else:
             print('不需要驗證碼')
             post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in'
             post_data = {
                 "client_id": self.client_id,
                 "username": "",  # 輸入知乎用户名(手机号)
                 "password": "",  # 輸入知乎密碼
                 "grant_type": self.grant_type,
                 "source": self.source,
                 "timestamp": self.timestamp,
                 "signature": self.get_signature(self.grant_type, self.client_id, self.source, self.timestamp),  # 獲取簽名
                 "lang": "en",
                 "ref_source": "homepage",
                 "captcha": '',
                 "utm_source": "baidu"
             }
             yield scrapy.FormRequest(
                 url=post_url,
                 formdata=post_data,
                 headers=self.headers,
                 callback=self.check_login
             )
         # yield scrapy.Request('https://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000),
         #                      headers=self.headers, callback=self.capture, meta={"resp": response})
         # yield scrapy.Request('https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
         #                      headers=self.headers, callback=self.capture, meta={"resp": response},dont_filter=True)

     def capture(self, response):
         # print(response.body)
         try:
             img = json.loads(response.body.decode('utf-8'))['img_base64']
         except ValueError:
             print('獲取img_base64的值失敗!')
         else:
             img = img.encode('utf8')
             img_data = base64.b64decode(img)

             with open('zhihu_capture.gif', 'wb') as f:
                 f.write(img_data)
                 f.close()
         captcha = input('請輸入驗證碼:')
         post_data = {
             'input_text': captcha
         }
         yield scrapy.FormRequest(
             url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
             formdata=post_data,
             callback=self.captcha_login,
             headers=self.headers
         )

     def captcha_login(self, response):
         try:
             cap_result = json.loads(response.body.decode('utf-8'))['success']
             print(cap_result)
         except ValueError:
             print('關於驗證碼的POST請求響應失敗!')
         else:
             if cap_result:
                 print('驗證成功!')
         post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in'
         post_data = {
             "client_id": self.client_id,
             "username": "",  # 輸入知乎用户名(手机号)
             "password": "",  # 輸入知乎密碼
             "grant_type": self.grant_type,
             "source": self.source,
             "timestamp": self.timestamp,
             "signature": self.get_signature(self.grant_type, self.client_id, self.source, self.timestamp),  # 獲取簽名
             "lang": "en",
             "ref_source": "homepage",
             "captcha": '',
             "utm_source": ""
         }
         headers = self.headers
         headers.update({
             'Origin': 'https://www.zhihu.com',
             'Pragma': 'no - cache',
             'Cache-Control': 'no - cache'
         })
         yield scrapy.FormRequest(
             url=post_url,
             formdata=post_data,
             headers=headers,
             callback=self.check_login
         )

     def check_login(self, response):
         # 驗證是否登錄成功
         text_json = json.loads(response.text)
         print(text_json)
         yield scrapy.Request('https://www.zhihu.com/inbox', headers=self.headers)
GZhangjl commented 6 years ago

您好,第一次有人在我仓库提issue,有点紧张,谢谢您。根据后来我运行的情况,这个重定向应该是由于同一个账号或者ip频繁访问知乎,导致被检测到异常,请求被重定向到 https://www.zhihu.com/account/unhuman?type=unhuman&message=...&need_login=false 这样一个URL,需要输入验证码才能继续。 从我的运行来看,如果在被知乎阻碍的情况下再一次运行Selenium进行模拟登陆,输入完账号密码后就会直接跳转到输验证码页面,在这时我更新了代码,通过代码提取了验证码并进行输入来通过验证,您可以再试试运行代码。因为也是在不断提高完善中,再次感谢您的issue。 另外,您附上的另一份爬虫代码也让我学到了很多,谢谢分享。

qiezigao commented 6 years ago

感谢您的更新!第一次被回复issue我也略紧张_| ̄|●正在学习scrapy,非常感谢您的帮助! 祝顺利!~非常感谢!