This is a Regular Expressions library for Java. Compared to java.util.regex, it supports Recursive and Conditional Regular Expressions, Capture Trees and Plugins.
This tutorial assumes that you are already familiar with Regular Expressions and also with the Regular Expression API of Java.
The API is exactly the same as in java.util.regex. The only difference is that the required import statement is import com.florianingerl.util.regex.*;
instead of import java.util.regex.*;
Throughout this tutorial, we will make use of the following two utility functions:
import com.florianingerl.util.regex.*;
private static void check(String p, String s, boolean expected)
{
Matcher matcher = Pattern.compile(p).matcher(s);
if (matcher.find() != expected)
System.exit(1);
}
static void check(String regex, String input, String[] expected)
{
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
while (m.find()) {
result.add(m.group());
}
if (!Arrays.asList(expected).equals(result))
System.exit(1);
}
In the following screenshot, all the new features are summarized.
This section assumes that you are already familiar with Recursive Regular Expressions.
Be aware that the syntax (?R)
or (?0)
as in Perl is not supported by this library, only (?n)
where n
is greater than 0
or (?'groupName')
is supported. Examples:
//Matching Java types
String pattern = "^(?<javaType>[a-zA-Z]\\w*(?:\\<(?'javaType')(,(?'javaType'))*\\>)?)$";
check(pattern, "List<Integer>", true);
check(pattern, "HashMap<Integer,String>", true);
check(pattern, "Map<Integer,List<String>>", true);
//Matching constructs with an equal number of open and closing braces
check("(?<brace>\\((?:[^()]+|(?'brace'))*+\\))", "(go away (here (everything) is fine) afterwards",
new String[] { "(here (everything) is fine)" });
//Matching anagrams
pattern = "\\b(([a-zA-Z])(?1)?\\2|[a-zA-Z])\\b";
check(pattern, "anna is an anagram, so is lagerregal and otto and radar and every single letter like z",
new String[] { "anna", "lagerregal", "otto", "radar", "z" });
Different Regular Expression libraries handle Recursion, Group Capturing and Backreferences differently, so this topic deserves some comment here.
com.florianingerl.util.regex isolates capturing groups between each level of recursion. When the regex engine enters recursion, all capturing groups appear as they have not participated in the match yet. Initially, all backreferences will fail. During the recursion, capturing groups capture as normal. Backreferences match text captured during the same recursion as normal. When the regex engine exits from the recursion, all capturing groups revert to the state they were in prior to the recursion, except for the capturing group that was recursed to and has just been freshly captured. These tests will illustrate:
String pattern = "(?(DEFINE)(?<letter>[a-zA-Z]))\\b(?<anagram>(?'letter')(?'anagram')?\\k<letter>|(?'letter'))\\b";
check(pattern, "anna is an anagram, so is lagerregal and otto and radar and every single letter like z",
new String[] { "anna", "lagerregal", "otto", "radar", "z" });
pattern = "(?(DEFINE)(?<wrapper>(?<letter>[a-zA-Z])))\\b(?<anagram>(?'wrapper')(?'anagram')?\\k<letter>|(?'letter'))\\b";
check(pattern, "otto", false);
pattern = "(?(DEFINE)(?<second>\\k<first>))(?<first>[a-z])(?'second')";
check(pattern, "bb", false);
This section assumes that you are already familiar with Conditional Regular Expressions .
The syntax supported by this library is (?(groupNumber)yes|no)
or (?(groupName)yes|no)
. An example:
String pattern = "(?:(\\()|\\[)[a-z]+(?(1)\\)|\\])";
check(pattern, "(first) [second] [not third) (not fourth match]", new String[] {"(first)", "[second]" });
In the Capture Tree, all groups matched during the previous match operation are saved in a way that reflects the hierarchical nature of the groups.
The concept is best illustrated by an example. The following Regex should parse mathematical terms such as (6*[6+7+8]+9)*78*[4*(6+5)+4]
.
(?x) # comment mode
(?(DEFINE)
(?<term>(?'number')|(?'sum')|(?'product'))
(?<sum>
(?'summand')(?:\+(?'summand'))+
) # end of sum
(?<summand> (?'number') | (?'product') | \( (?: (?'sum') | (?'product') ) \) | \[ (?: (?'sum') | (?'product') ) \]) # end of summand
(?<product>
(?'factor')(?:\*(?'factor'))+
) # end of product
(?<factor>(?'number')| \( (?: (?'sum') | (?'product') ) \) | \[ (?: (?'sum') | (?'product') ) \] ) # end of factor
(?<number>\d+)
)# end of DEFINE
(?'term')
After having parsed a term, you can inspect the so-called Capture Tree of the match, which reflects the hierarchical nature of the groups. E.g. in this case, the term (6*[6+7+8]+9)*78*[4*(6+5)+4]
is a product which consists of three factors. The first one of these factors is a sum and so on...
The following code
String regex = IOUtils.toString( new FileInputStream("term.regex"), "UTF-8");
Pattern p = Pattern.compile(regex);
String term = "(6*[6+7+8]+9)*78*[4*(6+5)+4]";
System.out.println("Capture Tree for: " + term);
Matcher m = p.matcher(term);
m.setMode(Matcher.CAPTURE_TREE);
m.matches();
System.out.println(m.captureTree());
produces the output seen in the following screenshot:
Continuing with the example from above, we now want to replace each sum like 6+7
with sum(6,7)
and to replace each product like 5*3
with product(5,3)
. The short code snippet
String replacement = m.replaceAll( new DefaultCaptureReplacer() {
@Override
public String replace(CaptureTreeNode node) {
if ("sum".equals(node.getGroupName())) {
return "sum(" + node.getChildren().stream().filter(n -> "summand".equals(n.getGroupName()))
.map(n -> replace(n)).collect(Collectors.joining(",")) + ")";
} else if ("product".equals(node.getGroupName())) {
return "product(" + node.getChildren().stream().filter(n -> "factor".equals(n.getGroupName()))
.map(n -> replace(n)).collect(Collectors.joining(",")) + ")";
} else if (Arrays.asList("summand", "factor").contains(node.getGroupName())) {
return replace(node.getChildren().get(0));
} else {
return super.replace(node);
}
}
});
System.out.println(replacement);
prints
product(sum(product(6,sum(6,7,8)),9),78,sum(product(4,sum(6,5)),4))
You can install plugins for the Regex engine. The method of the Pattern class seen in the screenshot below is used for that purpose.
An example for a plugin is given in PluginTest.java.
Unfortunately, this library needs more stacks than java.util.regex which can lead to a StackOverflowException
more quickly in rare cases.
E.g. suppose you wanted to match Java strings with the following Regex:
"(\\.|[^"])*"
Then this would only work for string lengths up to 3890, whereas with java.util.regex it would work with string lengths up to 6930. The problem is the *-repetition that has to keep track of group captures and the backtracking options of the alternation. However a simple character class can be repeated nearly an unlimited number of times, so the Regex above could be improved to
"(?:\\.|[^"\\]+)*"
Read the JavaDoc for this library.
Download the library from Maven central.
Alternatively, download this repository and build the library yourself with Ant or Maven.
Contributions are welcome. Just open up a new issue where you describe a bug or a feature request.