FuLLL0912 / Python

0 stars 0 forks source link

Book3 #1

Open FuLLL0912 opened 4 years ago

FuLLL0912 commented 4 years ago

Finish it before 3/20/2020

FuLLL0912 commented 4 years ago

3.1 urllib Modules : request/error/parse/robotparser (1)urlopen()

得到网站源代码

import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.read().decode('utf-8'))

查看输出响应的类型

import urllib.request response = urllib.request.urlopen('https://www.python.org') print(type(response))

得到response的相关属性

print(response.status) print(response.getheaders()) print(response.getheaders('Server'))

FuLLL0912 commented 4 years ago

传递参数给链接,this is the API of urlopen() function urllib.request.urlopen(url, data=None, [timeout, ]*, cafile =None, capath = None, cadefault = False, context = None) (1)data 如果要添加字节流编码格式的内容,即bytes类型,则需要用bytes()转化 import urllib.parse #将参数转化为字符串,相当于str() import urllib.request data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding = 'utf8') response = urllib.request.urlopen('http://httpbin.org/post', data=data) #httpbin.org/post测试POST

print(response.read())

(2)timeout Throw an error if no response received in setting time import urllib.request response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) print(respnse.read()) 可设置一个时长令python跳过抓取,try except import socket import urllib.request import urllib.error

try: response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) except urllib.error.URLErrot as e: if isinstance(e.reason, socket.timeout): print('TIME OUT')

(3)Others cafile 指定CA证书 capath 指定路径 context 指定SSL设置

FuLLL0912 commented 4 years ago
  1. Request import ullib.request request = urllib.request.Request('http://python.org') response = urllib.request.urlopen(request) print(response.read().decode('utf-8')) API: class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None) url Mandatory parameter others Optional (1)data (2)headers (3)origin_req_host (4)unverifiable (5)method

from urllib import request, parse url = 'http://httpbin.org/post‘ headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', } dict = { 'name' : ‘Germey' } data = bytes(parse.urlencode(dict), encoding='utf8') req = request.Request(url=url, data=data, headers=headers, method='POST') response = request.urlopen(req) print(response.read().decode('utf-8'))

headers也可以用add_header()来添加 req = request.Request(url=url, data=data, method ='POST') req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5;Windows NT)')

高级用法 Handler* (1)Log-in (2)Agent (3)Cookies

3.1.2 Error 3.1.3 Analyze URL (1)urlparse() (2)urlunparse() (3)urlsplit (4)urlunsplit() 5.urljoin() 6.urlencode() 7.parse_qs() 8.parse_qsl() 9.quote() 10.unquote()

3.1.4 Robots协议

FuLLL0912 commented 4 years ago

3.2 Requests import requests r = requests.get('https://https://www.baidu.com/') print(type(r)) print(r.status_code) print(type(r.text)) print(r.text) print(r.cookies)

其他类型请求 r = requests.post('http://httpbin.org/post') r = requests.put('http://httpbin.org/put') r = requests.delete('http://httpbin.org/delete') r = requests.head('http://httpbin.org/get') r = requests.options('http://httpbin.org/get')

params参数 P124

抓取网页 import requests import re

headers = { 'User- } r = requests.get("https://www.zhihu.com/explore", headers = headers) pattern = re.compile('explore-feed.?question_link.?>(.*?)', re.S) titles = re.findall(pattern, r.text) print(titles)

抓取二进制 import requests r = requests.get("https://github.com/favicon.ico") print(r.text) print(r.content) with open('favicon.ico', 'wb') as f: f.write(r.content)

POST请求 P127

3.2.2 高级用法 (1)文件上传 (2)Cookies (3)会话维持 (4)SSL证书验证 (5)代理设置 (6)超时设置 (7)身份认证 (8) Prepared Request

3.3 Regex

  1. \w \W
  2. match() import re result = re.match(.....) print(result.group(1)) 修饰符 re.I/re.L/re.M/re.S/re.U/re.X
  3. search
  4. findall()
  5. sub() html = re.sub('<a.*?>|',"",html) range, method, name of str 6.compile()

3.4抓取猫眼电影爬行 P150

FuLLL0912 commented 4 years ago

3/14/2020 Ch4. 4.1 XPath nodename / // . .. @ //title[@lang='eng'] 选择所有名称为title,同时属性lang的值为eng的节点

Package- lxml from lxml import etree

构造XPath解析对象,etree可以自动修正HTML文本 text = '..........' html = etree.HTML(text) #调用HTML将TEXT初始化 result = etree.tostring(html) #tostring修正文本,例如节点未闭合 print(result.decode('utf-8')) #decode将其转化成str

html = etree.parse('./test.html',etree.HTMLParser()) #直接读取文本进行解析

  1. 所有节点 result = html.xpath('//*') result = html.xpath('//li') print(result[0])

2.子节点 result = html.xpath('//li/a') #li的所有直接a子节点 result = html.xpath('//ul//a') #ul下的所有子孙a节点,即使ul下没有a作为子节点也可以

7.父节点 result = html.xpath('//a[@href="link4.html"]/../@class' #选中href属性为link4.html的a节点,然后再获取其父节点,再获取其class的属性

也可用parent:: 替代 ..

  1. 属性匹配 result = html.xpath('//li[@class="item-0"]')

  2. 文本获取 result = html.xpath('//li[@class="item-0"]/text()')??

  3. 属性获取 result = html.xpath('//li/a/@href')

11.属性多值匹配 result = html.xpath('//li[contains(@class="li")]/a/text()')

  1. 多属性匹配 result = html.xpath('//li[contains(@class, "li") and @name= "item"]/a/text()')

13.按序选择 result = html.xpath('//li[1]/a/text()') result = html.xpath('//li[last()]/a/text()'] result = html.xpath('//li[position()<3]/a/text()') result = html.xpath('//li[last()-2]/a/text()')

14.节点轴选择 result = html.xpath('//li[1]/ancestor::') '获取所有祖先节点 result = html.xpath('//li[1]/ancestor::div') ‘.........., 只获取div节点 result = html.xpath('//li[1]/attribute::') ’调用attribute轴,获取所有的属性值 result = html.xpath('//li[1]/child::a[@href="link1.html"]') ‘调用child轴,获取子伦,再href属性的a result = html.xpath('//li[1]/descendant::span') ’调用descendant,获取子孙,再span节点, result = html.xpath('//li[1]/following::[2]') ’获取当前节点后的所有节点,第二个后续节点 result = html.xpath('//li[1]/following.sibling::') ‘获取当前节点后的同级节点

FuLLL0912 commented 4 years ago

Import BeautifulSoup,re,requests

links = requests.get("ABC").text soup = BeautifulSoup(links, "html.parser") soup.find_all=['a'] #Finding all tags

for tag in soup.find_all(re.compile("^b")): print(tag.name) #Filter against that regular expression, tag

soup.find_all(["a","b"]) #Find all tags and tags

def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') soup.find_all(has_class_but_no_id)

def not_lacie(href): return href and not re.compile("lacie").search(href) soup.find_all(href=not_lacie)

def surrounded_by_strings(tag): #Filter against return(isinstance(tag.next_element, NavigabaleString) and isinstance(tag.previous_element, NavigableString)) for tag in soup.find_all(surrounded_by_strings): print tag.name

soup.find_all("p","title") soup.find_all(id="link2")

soup.find(string=re.compile("sisters"))
soup.find_all(href=re.compile("elsie")) soup.find_all(id=True)
soup.find_all(href=re.compile("elsie"),id="link1")

data_soup = BeautifulSoup('

foo!
') data_soup.find_all(data-foo="value")

SyntaxError : keyword can't be an expression

data_soup.find_all(attrs={"data_foo": "value"})

name_soup = BeautifulSoup('') name_soup.find_all(name="email")

[]

name_soup.find_all(attrs={"name":"email"})

[]

Searching by CSS

soup.findall("a",class="sister") soup.findall(class=re.compile("it1")) def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 soup.findall(class=has_six_characters)

css_soup.findall("p", class="body strikeout")

If you want to search for tags that match two or more CSS classes

css_soup.select("p.strikeout.body")

which don't have class_short cut, we can use attrs

soup.find_all("a", attrs={"class":"sister"})

String filter instead of tags

soup.find_all(string="Elsie") soup.find_all(string=["Tillie","Elsie","Lacie"]) soup.find_all(string=re.compile("Dormouse"))

def is_the_only_string_within_a_tag(s): return(s == s.parent.string) soup.find_all(string=is_the_only_string_within_a_tag)
soup.find_all("a", string="Elsie") before it is text = "Elsie"

RECURSIVE ARGUMENT

FuLLL0912 commented 4 years ago

4.2 BeautifulSoup 解析器 BeautifulSoup(markup, 'html') BeautifulSoup(markup, 'lxml') ---hxml BeautifulSoup(markup, 'xml') ---hxml BeautifulSoup(markup, 'html5lib')

节点选择器 from bs4 import BeatifulSoup soup = BeautifulSoup('html, 'lxml') print(soup.title) --节点加文字内容 print(type(soup.title)) print(soup.title.string) '节点的文本内容 print(soup.head) print(soup.p) -- only select the first p

提取信息 (1)获取名称 print(soup.title.name) (2) 获取属性 print(soup.p.attrs) ---- 返回结果是字典形式 print(soup.p.attrs['name']) ----从字典中获取name属性 获取属性值 print(soup.p['name']) ---- 但如果又多个属性值,那么会是列表 (2)获取内容 print(soup.p.string)

嵌套选择 print(soup.head.title) ----head之后的title, 依然是tag结构 print(type(soup.head.title)) print(soup.head.title.string)

关联选择 (1)子节点和子孙节点 print(soup.p.contents) '获取直接子节点 print(soup.p.children) '获取直接子节点

print(soup.p.descendants) '获取所有子孙节点 for i, child in enumrate(soup.p.descendants): print(i, child)

(2)父节点和祖先节点 print(soup.a.parent0 ’获取某个节点的直接父节点(第一个) print(soup.a.parents()) '获取某个节点的所有祖先节点

(3)兄弟节点 print("Next Sibling', soup.a.next_sibling) print("Prev Sibling', soup.a.previous_sibling) print("Next Sibling', list(enumerate(soup.a.next_siblings())) print("Prev Sibling;, list(enumerate(soup.a.previous_siblings)))

(4)提取信息 print(type(soup.a.next_sibling)) print(soup.a.next_sibling) print(soup.a.next_sibling.string) print(list(soup.a.parents)[0])

6.方法选择器 find_all() (1) find all by name print(type(soup.find_all(name='ul')[0])) (2)attrs print(soup.find_all(attrs=['id': 'list-1'})) class/id 可不用attrs
print(soup.findall(class= 'element')) print(soup.find_all(id='list-1')) (3)text print(soup.find_all(text=re.compile('link')))

find() 只匹配第一个元素 find_parents() 返回所有祖先节点 find_parent()返回直接父节点 find_next_siblings() 和find_next_sibling() 前者返回所有兄弟节点,后者返回第一个兄弟节点 find_previous_siblings() 和 find_previous_sibling() 前者返回前面所有兄弟节点,后者返回第一个兄弟 find_all_next() 和 find_next() find_all_previous 和 find_previous()

  1. CSS选择器 print(soup.select('.panel .panel-heading')) ?? print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))

嵌套选择 for ul in soup.select('ul'): print(ul.select('li’))

获取属性 for ul in soup.select('ul'): print('ul['id']) print(ul.attrs['id'])

获取文本 for li in soup.select('li'): print('Get Text:', li.get_text()) print('String', li.string)

FuLLL0912 commented 4 years ago

4.3使用pyquery 比BS更适合使用css

字符串初始化 from pyquery import PyQuery as pq doc = pq(html) print(doc('li'))

URL初始化 doc = pq(url = 'https://cuiqingcai.com') 文件初始化 doc = pq(filename = 'demo.html') print(doc('li‘))

Basic CSS selector print(doc('#container .list li')) 'Select id= container> all li with class=list within it print(type(doc('#container .list li'))) 'type = PyQuery

Lookup Node 子孙节点 doc = pq(html) items = doc('.list') print(items) lis = items.find('li') print(type(lis)) print(lis)

子节点 lis = items.children() print(type(lis))

lis = items.children('.active') print(lis)

parent node doc = pq(html) items = doc('.list') container = items.parent() ---only direct parent node print(type(container)) print(container)

ancestor node items = doc('.list') parents = items.parents() ----all ancestor node print(type(parents))

sibling node li = doc('.list .item-0.active') print(li.siblings()) all sibling node

li = doc('.list .item-0.acive') print(li.siblings('.active')) --- class = active

  1. Traversal doc = pq(html) lis = doc('li').items() print(type(lis)) for li in lis: print(li, type(li))

  2. retrieve info (1)attribute doc = pq(html) a = doc(' .item-0.active a') 'select a with class=item-0, active li print(a, type(a)) print(a.attr('href'))

attr can only select one node

(2)text a = doc(' .item-0.active a') 'selecta , then return only text, no HTML print(a.text())

li = doc('.item-0.active') print(li.html())

  1. node li = doc(' .item-0.active') li.removeClass('active')
    li.addClass('active')

li.attr('name', 'link') 'change attribute, name is attribute, link is attribute value li.text('changed item') 'change content li.html('changed item')

wrap = doc('.wrap') wrap.find('p').remove() print(wrap.text())

  1. Pseudo-classes selector li = doc('li:first-child') 'first li node li = doc(li:last-child') 'last li node li = doc(li:nth-child2)') 'secon li node li = doc(li:gt(2)') 'li node after third li node li = doc('li:nth-child(2n)') 'even number of li node li = doc('li:contain(second)') li node included in second text
FuLLL0912 commented 4 years ago

Ch5 Saving 5.1 File saving 5.1.1 TXT Skip 5.1.2 JSON Skip 5.1.3 CSV import csv

write in List with open('data.csv', 'w') as csvfile: writer = csv.writer(csvfile) writer.writerow(['íd', ''name'', ''age'']) ", as separator" new writer means a new row writer.writerow([''100001","Mike",20)

with open('data.csv', 'w') as csvfile: writer = csv.writer(csvfile, delimiter=' ') writer.writerow(['íd', ''name'', ''age''])
writer.writerow([''100001","Mike",20)

or writer.writerows(['id', 'name', 'age'], ['10001', 'Mike',25],['10002','Jay',22])

Write in dictionary with open('data.csv', 'w') as csvfile: fieldnames = ['id', 'name','age'] write = csv.DicWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerow(['id'=''100001",'name'="Mike",'age'=20)

Append new row to existed file with open('data.csv', 'w') as csvfile: writer = csv.writer(csvfile) writer.writerow(['íd', ''name'', ''age'']) ", as separator" new writer means a new row writer.writerow([''100001","Mike",20)

FuLLL0912 commented 4 years ago

Issue about Chinese with open('data.csv', 'a', encoding='utf=8'] as csvfile: filenames = ['id', 'name', 'age'] writer = csv.Dictwriter(csvfile, fieldnames=fieldnames) writer.writerow({'id':'10005', 'name':'王伟','age':12})

2.Read import csv with open('data.csv', 'r', encoding='utf-8') as csvfile: reader = csv.reader(csvfile) for row in reader: print(row)

Using Pandas: import pandas as pd df = pd.read_csv('data.csv') print(df)

5.2.1 MySQL

FuLLL0912 commented 4 years ago

Ch6 Ajax 6.1 Ajax: Asynchronous JavaScript and XML 6.2 Analyse Method (1)Check requests 6.3 Get data (1) Analyze from urllib.parse import urlencode import requests base_url = 'https://m.weibo.cn/api/container/getIndex?' headers = { 'Host' : 'm.weibo.cn', 'Referer' : 'https://m.weibo.cn/u/28306078474', 'User-Agent' 'X-Requested-With' : 'XMLHttpRequests', } def get_page(page): params = { 'type': 'uid', 'value': '2830678474', 'containerid' : '1076032830678474', 'page': page } url = base_url + urlencode(params) try : response = requests.get(url, headers=headers) if respones.status_code == 200: return response.json() except requests.ConnectionError as e: print('Error', e.args)

from pyquery import PyQuery as pq def parse_page(json): if json: items = json.get('data'.get('cards') for item in items: item = item.get('mblog') weibo = {} weibo['id'] = item.get('id') weibo['text'] = pq(item.get('text')).text() weibo['attitudes'] = item.get('attitudes_count') weibo['comments'] = item.get('comments_count') yield weibo if name =='main' : for page in range(1,11): json = get_page(page) results = parse_page(json) for result in resultes: print(result)

FuLLL0912 commented 4 years ago

Ch7 7.1 Selenium

(1) Get URL/Cookies/Raw code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome() try: browser.get('https://www.baidu.com') input = browser.find_element_by_id('kw') input.send_keys('Python') input.send_keys(Keys.ENTER) wait = WebDriverWait(browser, 10) wait.until(EC.presence_of_element_located((By.ID, 'content_left'))) print(browser.current_url) print(browser.get_cookies()) print(browser.page_source) finally: browser.close()

(2) Different browsers browser = webdriver.Chrome()/FireFox()/Edge()/PhantomJS()/Safari()

(3)visit page from selenium import webdriver

browser = webdriver.Chrome() browser.get('https://www.taobao.com') browser.close()

(4)locate elements ... input_first = browser.find_element_by_id('mq') input_second = browser.find_element_by_xpath('//*[@id="mq"]') input_third = browser.find_element_by_css_selector('#q')

find_element_by_name find_element_by_link_text find_element_by_partial_link_text find_element_by_tag_name find_element_by_class_name

find.element(By.ID, id)-----General Use ID can be replaced by other things

----multiple elements find.element can only find the first element lis = browser.find_elements_by_css_selector('.service-bd li') print(lis) browser.close()

Summary: find_element----> find_eleemnts

(5)Interacting elements from selenium import webdriver import time

browser = webdriver.Chrome() browser.get('https://www.taobao.com') input = browser.find_element_by_id('mq') input.send.keys('iPhone') time.sleep(1) input.clear() input.send_keys('iPad') button = browser.find_element_by_clas s_name('btn-search') button.click()

(6) Navigating https://selenium-python.readthedocs.io/navigating.html

(7)Execute JavaScript

(8) Get elements info ---attribute from selenium import webdriver from selenium.webdriver import ActionChains

browser = webdriver.Chrome() url = 'https://www.sina.com.cn/' browser.get(url) logo = browser.find_element_by_class_name('sina-logo') print(logo) print(logo.get_attribute('class'))

---Text from selenium import webdriver

browser = webdriver.Chrome() url = 'https://www.sina.com.cn' browser.get(url) input = browser.find_element_by_class_name('sina-logo') print(input.text)

FuLLL0912 commented 4 years ago

Ch 12 pyspider Struture-- Scheduler/Fetcer/Processer, Monitor, Result Worker

FuLLL0912 commented 4 years ago

Ch13 Scrapy

  1. Structure Engine Item Scheduler Downloader Spiders Item Pipeline Downloader Middlewares Spider Middlewares 2.Data Stream