Pin-Jiun / Python

Python Document
0 stars 0 forks source link

12-Python RegEx #12

Open Pin-Jiun opened 2 years ago

Pin-Jiun commented 2 years ago

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

import re

通常在過程中匹配的字符數量,會有兩種情況:貪婪與非貪婪,而Python中的預設模式為貪婪模式

貪婪:,不斷嘗試匹配更多的字符 非貪婪:盡可能的嘗試少匹配字符,它會盡量減少匹配重複的字符 程式碼舉例:


import re
​
## 貪婪模式
print(re.search('go*', 'goooooood').group()) ## 'gooooooo'
​
## 非貪婪模式
print(re.search('go*?', 'goooooood').group()) ## 'g'
gooooooo
g

怎麼使用非貪婪模式呢? 在語法後面加上一個”?”,如下圖的非貪婪模式介紹

image

非貪婪模式的常見用法 .?的用法介紹 `.:盡量匹配較少的字符 大多用在像是.?a的地方,意思是前面匹配任何的字符,直到字母a出現re.search('.?e','a_b*c defg').group()`


match 函數用法

re.match會從文本中的起始位置開始進行文字符的匹配,如果不是一開始第一個字符就匹配成功的話,就會直接返回一個none 符合就會回傳Match Object, 否則回傳None

re.match(pattern, string, flags)
import re

text = 'https://matters.news/@CHWang'
text1 = 'Matters.news'
print(re.match('https', text))
print(re.match('https', text).span())
print(re.match('matters', text))
print(re.match('matters', text1))
print(re.match('matters', text1, flags = re.I))
<re.Match object; span=(0, 5), match='https'>
(0, 5)
None
None
<re.Match object; span=(0, 7), match='Matters'>

Match Object

A Match Object is an object containing information about the search and the result. If there is no match, the value None will be returned, instead of the Match Object.

The Match object has properties and methods used to retrieve information about the search, and the result:

.span() returns a tuple containing the start-, and end positions of the match.(不包含end) .string returns the string passed into the function .group() returns the part of the string where there was a match


關於group()的用法

import re
​
​
text = 'Jack lives in HsinChu and he is 25 years old, but ...'
​
match_result = re.match(r'(.*) lives in ([a-z]*) and he is (\d+).*', text, re.I)

print(match_result.group())
print(match_result.group(1))
print(match_result.group(2))
print(match_result.group(3))
​
print(type(match_result.groups()))
print(match_result.groups())
Jack lives in HsinChu and he is 25 years old, but ...
Jack
HsinChu
25
<class 'tuple'>
('Jack', 'HsinChu', '25')

The search() Function

re.search會搜尋整個字符串,然後找到匹配的字符並且傳回,如果失敗,沒有匹配到任何字符則傳回none,如果成功,就會傳回一個匹配的對象,就可以使用group()來取得匹配成功的字符

re.search(pattern, string, flags)
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start()) 
#The first white-space character is located in position: 3

x = re.search("ai", txt)
print(x)
#<_sre.SRE_Match object; span=(5, 7), match='ai'>

x = re.search(r"\bS\w+", txt)
print(x.span())
#(12, 17)

x = re.search(r"\bS\w+", txt)
print(x.string)
#The rain in Spain

x = re.search(r"\bS\w+", txt)
print(x.group())
#Spain
import re
​
text = 'Jen likes to eat cake and drink coke, but ...'
​
match_result = re.search('(.*) likes to eat (\w+) and drink ([a-z]*)', text, re.I|re.M)
​
print(match_result.group())
print(match_result.group(1))
print(match_result.group(2))
print(match_result.group(3))
​
print(match_result.groups())
Jen likes to eat cake and drink coke
Jen
cake
coke
('Jen', 'cake', 'coke')

match與search的差別?其實差別就在match一定要從起始位置開始匹配成功,而search則不用的喔!!


The findall() Function

findall(pattern, string, pos, endpos)

re.findall會直接找尋所有匹配的字符,裝進串列後返回,如果沒有找到匹配的字符,就會回傳一個空的串列喔

小筆記:re.findall會匹配所有符合規則的字符,而re.search與re.match只會匹配一次而已喔

  1. pattern: 匹配的規則,使用正則表達式的語法來撰寫
  2. string:欲進行匹配的字符串
  3. pos: 可選擇的參數,不一定要寫,指定開始匹配的位置,預設為0,也就是起始字符的位置
  4. endpos: 可以選擇的參數,不一定要添加,指定結束匹配字符串的位置
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)  #['ai', 'ai']

if no matches are found, an empty list is returned

import re
​
find_pattern = re.compile(r'[a-z]+', re.I)
​
match_result1 = find_pattern.findall('good 66 day Tom_28 Yep')
match_result2 = find_pattern.findall('good98MMorning66 Jen666 Yeah', 6,20)
​
print(match_result1)
print(match_result2)
['good', 'day', 'Tom', 'Yep']
['MMorning', 'Jen']

The split() Function

re.split將匹配的字符進行切割,並且回傳一組串列

re.split(pattern, string, maxsplit, flags)
  1. pattern: 匹配的規則,使用正則表達式的語法撰寫
  2. string: 欲進行匹配的字符串
  3. maxsplit: 分割的次數,如maxsplit=1,代表分割一次,預設為0,表示不限分割次數
  4. flags: 設定一些匹配的模式
import re

#Split the string at every white-space character:

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)
#['The', 'rain', 'in', 'Spain']

You can control the number of occurrences by specifying the maxsplit parameter:

import re

txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)
#['The', 'rain in Spain']
#### import re
​
text = 'Jack66Jen58Ken28Cathy'
​
## 用數字來做為分隔依據
print(re.split('\d+', text))
​
## 分隔,並將數字也傳進陣列
print(re.split('(\d+)', text))
​
## 如果匹配的一句剛好在前後的位置,就會傳回空值
text1 = '66Jack66Jen58Ken28Cathy38'
print(re.split('\d+', text1))
​
## 如果找不到匹配會回串全部字串
print(re.split('\s+', text1))
['Jack', 'Jen', 'Ken', 'Cathy']
['Jack', '66', 'Jen', '58', 'Ken', '28', 'Cathy']
['', 'Jack', 'Jen', 'Ken', 'Cathy', '']
['66Jack66Jen58Ken28Cathy38']

The sub() Function

匹配好字符後,將它替換成我們想要的字符,這個方法相當方便,我們在進行數據處理時,有時候會有一些多餘的不要的空格、符號等等,就可以透過這個方法來一次拿掉

函數語法格式 re.sub(pattern, repl, string, count = 0, flags)

會回傳取代完畢的字串

import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

You can control the number of replacements by specifying the count parameter:

import re

#Replace the first two occurrences of a white-space character with the digit 9:

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)
#The9rain9in Spain
import re 
​
text = 'Jack/25/1993 and Jen/23/1995'
​
## 把中間的and與空格拿掉,用&替換
sub_result1 = re.sub('\sand\s', '&', text)
print(sub_result1)
​
## 狀況一: 再把/拿掉
sub_result2 = re.sub('/', '', sub_result1)
print(sub_result2)
​
## 狀況二: 再把/拿掉,但只要拿掉前兩個
sub_result3 = re.sub('/', '', sub_result1, 2)
print(sub_result3)
Jack/25/1993&Jen/23/1995
Jack251993&Jen231995
Jack251993&Jen/23/1995

Compile 函數

re.compile可以幫助我們編譯正則表達式,並生成一個pattern對象,來供給match、search、findall函數使用,簡單來說,就是我們只要定義好一次正則表達式的規則,就能用這個定義好的pattern規則,來提供match、search、findall函數匹配字符

用了這個方法後,我們就不用每次使用匹配函數時,都要重新寫一次正則表達式語法,但明明匹配的規則與寫法是一樣的

re.compile(pattern, flags)
import re
​
text = '68Jack66Jen58Ken28,Cathy38'
​
## 匹配字母,並忽略大小寫
pattern = re.compile(r'([a-z]+)', re.I)
​
## match預設從第一個位置開始匹配
compile_result1 = pattern.match(text)
print(compile_result1) ## None,因為match會從第一個位置開始匹配,如果不通過就會返回none
​
## 從第3個位置開始匹配
compile_result2 = pattern.match(text, 2, 20)
print(compile_result2) 
​
​
print(compile_result2.group(0)) 
print(compile_result2.start(0))
print(compile_result2.end(0))
print(compile_result2.span())
None
<re.Match object; span=(2, 6), match='Jack'>
Jack
2
6
(2, 6)

補充說明 group(): 匹配好後,會回傳一個tuple,會根據匹配成功的字符一組一組返回,但由於match方法只會回傳一組,所以只要寫group()就好,其他的話,諾我們想要回傳第一組就寫group(0),以此類推 start(): 起始位置,傳入要查詢的組別,像是第一組就寫start(0),以此類推 end(): 結束位置,傳入要查詢的組別,像是第一組就寫end(0),以此類推 span(): 傳回(起始位置,結束位置)

https://www.w3schools.com/python/python_regex.asp https://chwang12341.medium.com/%E7%B5%A6%E8%87%AA%E5%B7%B1%E7%9A%84python%E5%B0%8F%E7%AD%86%E8%A8%98-%E5%BC%B7%E5%A4%A7%E7%9A%84%E6%95%B8%E6%93%9A%E8%99%95%E7%90%86%E5%B7%A5%E5%85%B7-%E6%AD%A3%E5%89%87%E8%A1%A8%E9%81%94%E5%BC%8F-regular-expression-regex%E8%A9%B3%E7%B4%B0%E6%95%99%E5%AD%B8-a5d20341a0b2