chucklu / PythonTest

0 stars 0 forks source link

re library test #7

Open chucklu opened 3 years ago

chucklu commented 3 years ago
import re
rgx_pat = 'a(b|c)*d'
txt = 'ad,abd,acd,abbd,abcd,accd,acbccd,aed'
mat = re.search(rgx_pat, txt)
print(mat)
if(mat != None):
    print(mat.group())
chucklu commented 3 years ago
import re
txt = 'ab,a\nb'
rgx_pat = r'a.{0,4}b'
mat = re.search(rgx_pat, txt)
print(mat)
mat_str = mat.group()
print(mat_str)

<re.Match object; span=(0, 2), match='ab'> ab

chucklu commented 3 years ago
import re
txt = 'I#have##a###cat.'
rgx_pat = r'#+'
words = re.split(rgx_pat, txt)
print(words)

['I', 'have', 'a', 'cat.']

chucklu commented 3 years ago
import re
txt = 'I#have##a###cat.'
rgx_pat = r'#+'
txt_new = re.sub(rgx_pat, '\t', txt)
print(txt_new)

I have a cat.

chucklu commented 3 years ago
import re
txt = 'I#have##a###cat.'
rgx_pat = r'#+'
txt_new, count = re.subn(rgx_pat, '\t', txt)
print(txt_new)
print(count)
chucklu commented 3 years ago

https://docs.python.org/3/library/re.html re.findall(pattern, string, flags=0) Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

Changed in version 3.7: Non-empty matches can now start just after a previous empty match.

re.finditer(pattern, string, flags=0) Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

Changed in version 3.7: Non-empty matches can now start just after a previous empty match.

chucklu commented 3 years ago
import re
rgx_pat = 'a(b|c)*d'
txt = 'ad,abd,acd,abbd,abcd,accd,acbccd,aed'
mat = re.findall(rgx_pat, txt)
print(mat)

for mat in re.finditer(rgx_pat, txt):
    print(mat)

['', 'b', 'c', 'b', 'c', 'c', 'c'] <re.Match object; span=(0, 2), match='ad'> <re.Match object; span=(3, 6), match='abd'> <re.Match object; span=(7, 10), match='acd'> <re.Match object; span=(11, 15), match='abbd'> <re.Match object; span=(16, 20), match='abcd'> <re.Match object; span=(21, 25), match='accd'> <re.Match object; span=(26, 32), match='acbccd'>

chucklu commented 3 years ago
import re
rgx_pat_phone_no = r'^((13\d)(\d{8}))'
txt = '13312345679'
mat = re.fullmatch(rgx_pat_phone_no, txt)
print(mat)
if mat:
    for i in range(0,4):
        print(f'group({i}) = {mat.group(i)}')

<re.Match object; span=(0, 11), match='13312345679'> group(0) = 13312345679 group(1) = 13312345679 group(2) = 133 group(3) = 12345679

We can check the group detail of ^((13\d)(\d{8})) , it can be divided into three groups. group1 contains all, include group2 and group3 group2 is 13\d group3 is \d{8} image 1

chucklu commented 3 years ago
import re
rgx_pat_phone_no = r'^((13\d)(\d{8}))'
txt = '13312345679'
mat = re.fullmatch(rgx_pat_phone_no, txt)
print(mat)
if mat:
    for i in range(0,4):
        group = mat.group(i)
        print(f'group({i}) = {group}')
        print(f'group({i}) start from {mat.start(i)}, end at {mat.end(i)}')

<re.Match object; span=(0, 11), match='13312345679'> group(0) = 13312345679 group(0) start from 0, end at 11 group(1) = 13312345679 group(1) start from 0, end at 11 group(2) = 133 group(2) start from 0, end at 3 group(3) = 12345679 group(3) start from 3, end at 11

chucklu commented 3 years ago
import re
year_pattern = r'(?P<year>\d{4})年'
month_pattern = r'(?P<month>\d{1,2})月'
day_pattern = r'(?P<day>\d{1,2})日'
patterns_dic = {'year':year_pattern, 'month':month_pattern, 'day':day_pattern}
txt = '他出生在2020年3月。'
for key in patterns_dic:
    mat = re.search(patterns_dic[key], txt)
    print(mat)
    if mat:
        print(f'{mat.group(key)}')
chucklu commented 3 years ago
import re
txt = 'John is the the instructor.'
rgx_pat = r'\b(?P<word>\w+)\s+(?P=word)'
mat = re.search(rgx_pat, txt)
if mat:
    print(mat.group('word'))
chucklu commented 3 years ago

conditional match

import re
txt = '<hello>,hello!'
rgx_pat = r'(<)?hello(?(1)>|!)'
for mat in re.finditer(rgx_pat, txt):
    print(mat.group())

try to figure out the group 1 https://regexper.com/#%28%3C%29%3Fhello, you will find the first group is (<), () is used to mark a group,(pattern) means match the pattern and get the value。 https://regexper.com/#%28%3C%29 (\d) this is a group https://regexper.com/#%28%5Cd%29%3F (\d)? a group with ?, means the group can repeat 0 or 1 times https://regexper.com/#%5Cd%3F \d? number with ?, means the number can repeat 0 or 1times

将( 和 ) 之间的表达式定义为“组”(group),并且将匹配这个表达式的字符保存到一个临时区域(一个正则表达式中最多可以保存9个),它们可以用 \1 到\9 的符号来引用。

https://regexper.com/#%28%3C%29%3Fhello%28%5C1%3E%7C!%29 这个有2个group https://regexper.com/#%28%3C%29%3Fhello%28%3F%3A%5C1%3E%7C!%29 <)?hello(?:\1>|!) 只有一个group

chucklu commented 3 years ago

https://regexper.com/#hello%28%3F%3D!%29 https://regexper.com/#hello%28%3F!!%29 we can also check the regex explain on site https://regexr.com/ and https://regex101.com/

https://regex101.com/r/MhBWSa/1 (?<=<)hello https://regex101.com/r/MhBWSa/2 (?<!<)hello

chucklu commented 3 years ago

https://regexper.com/#%28%5Cw%2B%20%29%5Cs*%28%5Cr%29%3F%5Cn%5Cs

import re
txt = '''
This is a broken 
sentence.
This is another sentence.
'''
rgx_pat = r'(\w+ )\s*(\r)?\n\s*'
mats = re.finditer(rgx_pat, txt);
for mat in mats:
    print(mat)

txt = re.sub(rgx_pat, r'\1', txt)
print(txt)

The regex pattern contains two group, also contains \n to match. The first group is a word. When match the string, and find word+\n, then replace it with word. https://www.crifan.com/python_re_sub_detailed_introduction/

chucklu commented 3 years ago
import re
rgx_pat = 'a(b|c)*d'
rgx = re.compile(rgx_pat, flags=re.IGNORECASE)
txt = 'ad,abd,acd,abbd,abcd,ACCD,acbccd,aed'
for mat in rgx.finditer(txt):
    print(mat.group())
chucklu commented 3 years ago
import re
txt = '可能/AD 会/VV 把/BA 我们/PN 给/VV 卖了/NN'
rgx_pat = r'\s+把/[A-Z]+\s+(?P<noun>[\u4e00-\u9fa5]+/[A-Z]*N[A-Z]*)\s+给/[A-Z]+\s+(?P<verb>[\u4e00-\u9fa5]+/V[A-Z]*)\s+'
rgx = re.compile(rgx_pat)
mats = rgx.finditer(txt)
count = len(list(mats))
print(count)
for mat in mats:
    print(mat)
chucklu commented 3 years ago
import re
txt = '为了/p 保险/n 起/v 见/v'
rgx_pat = r'\s+(?P<verb>[起下上住到]/v[A-Z]*)\s+'
rgx = re.compile(rgx_pat)
mats = rgx.finditer(txt)
for mat in mats:
    print(mat)