bingoohuang / blog

write blogs with issues
MIT License
178 stars 23 forks source link

玩转一下正则表达式 #108

Open bingoohuang opened 5 years ago

bingoohuang commented 5 years ago

看了一篇用正则把自己玩死的博客,代码是这个样子的,结果是把cpu跑死,或者跑不出来结果。

public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!");
    }
}

正则:^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$ 字串:http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf

拷贝到Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript的regex101试一下,也能看到报告了灾难性回溯Catastrophic backtracking错误。

image

点击左边的Debugger,会有动画显示匹配过程,是一个很好地可视化过程。

image

最简单的修复,是调整一下正则,让匹配能成功进行,但是这个只是治标,还得考虑治本:

  1. 识别清晰正则表达式的场景(本例就不是正则表达式的正确使用场景,而应该使用类似于URLParser的解析器。
  2. Java正则匹配,可以作出某种限定,避免灾难性回溯。

想起若干年前,做靓号规则的时候,电信的BOSS系统也是用正则表达式来匹配一个手机号码,是否符合AABB,或者AAAA等模式的,正则表达式写起来也是相当不好看,真是王婆婆的裹脚又臭又长,所以轮到我来玩了,我就直接实现了ABC表达式,直接配置AABB或者AAAA等形式的表达式,就可以匹配是否是某种级别的靓号了,不亦乐乎。

bingoohuang commented 5 years ago

Removing the MySQL root password上看到一个正则temporary password(.*): \K(\S+),不理解其中的\K的含义,查询文档,了解了是重置前面所有的捕获,还有这种骚骚的用法

image

\K can be used to reset the match start since PHP 5.2.4. For example, the pattern foo\Kbar matches "foobar", but reports that it has matched "bar". The use of \K does not interfere with the setting of captured substrings. For example, when the pattern (foo)\Kbar matches "foobar", the first substring is still set to "foo".

参考

  1. Escape sequences
  2. mysql_temporary_password regex test
bingoohuang commented 4 years ago
  1. grep提取多个值 ggrep -oP "appId.:.\K([\w_]+)|deviceId.:.\K([\w_]+)" 1.log paste -d ' ' - - regex101
    $ grep -oP "appId.:.\K([\w_]+)|deviceId.:.\K([\w_]+)"  1.log  | paste -d ' ' - -
    APP_79D5EC7B7C55465F8FB8114C87517CF6 DEV_C77897738BF243B1AA1E8E316850F410
    APP_79D5EC7B7C55465F8FB8114C87517CF6 DEV_C77897738BF243B1AA1E8E316850F410
    APP_79D5EC7B7C55465F8FB8114C87517CF6 DEV_C77897738BF243B1AA1E8E316850F410
    DEV_37A4DD248CDC4DB7ADE0D3626CAB24CA APP_1698D243C3D04C46B1D488EBAE2F8348
  2. jq提取 grep -oP "\[\K({.*})(?=\])" 1.log | jq -c '. | {appId: .appId, deviceId: .deviceId}' regex101
    $ grep -oP "(?<=\[)({.*})(?=\])" 1.log | jq -c '. | {appId: .appId, deviceId: .deviceId}'
    {"appId":"APP_79D5EC7B7C55465F8FB8114C87517CF6","deviceId":"DEV_C77897738BF243B1AA1E8E316850F410"}
    {"appId":"APP_79D5EC7B7C55465F8FB8114C87517CF6","deviceId":"DEV_C77897738BF243B1AA1E8E316850F410"}
    {"appId":"APP_79D5EC7B7C55465F8FB8114C87517CF6","deviceId":"DEV_C77897738BF243B1AA1E8E316850F410"}
    {"appId":"APP_1698D243C3D04C46B1D488EBAE2F8348","deviceId":"DEV_37A4DD248CDC4DB7ADE0D3626CAB24CA"}
  3. demo 1.log
  4. How to use grep to extract multiple groups
  5. brew install grep How to install and use GNU Grep in OSX
bingoohuang commented 4 years ago

正则表达式详细介绍里面看到在线工具regexr.com也不错,可以作为一个备份工具

image

bingoohuang commented 4 years ago

普通人的正则表示式教程 一本免费的英文教程,向初学者介绍正则表达式,实例较多。

每一个正则,都有在线的解析高烈显示以及结构化图,例如:

UUID

[\da-f]{8}-([\da-f]{4}-){3}[\da-f]{12}/i RegExr Visual

image

image

一些资源

  1. awesome-regex A curated collection of awesome Regex libraries, tools, frameworks and software
  2. regex tag on StackOverflow
  3. StackOverflow RegEx FAQ
  4. r/regex
  5. RexEgg
  6. Regular-Expressions.info
  7. Regex Crossword
  8. Regex Golf
bingoohuang commented 2 years ago

image

Regular Expression Tester and Visualizer