Open utterances-bot opened 5 years ago
This might be obvious to some, but it would be worthwhile mentioning that backslashes need to be escaped when creating regular expressions using new Ereg() - and that this is not necessary with Regex literals. A couple of examples would make this 'gotcha' perfectly clear.
Hello, I'm not sure if I'm missing something obvious here but I can't seem to find anything regarding the case where one line contains multiple instances of the same regex.
For example, I have this demo code for parsing markdown into HTML. Specifically the URL parsing.
// Regex.hx
class Regex {
public static function main() {
var pattern: EReg = ~/!\[.*\]\(.*\)/i;
var word: String = "This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways";
trace(word);
trace(pattern.match(word));
trace(pattern.matched(0));
trace(pattern.matchedPos());
trace(pattern.split(word));
trace(pattern.matchedLeft());
trace(pattern.matchedRight());
}
}
Executing this code with $ haxe --main Regex.hx --interp
yields the following result.
Regex.hx:5: This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways
Regex.hx:6: true
Regex.hx:7: ![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky)
Regex.hx:8: {pos: 15, len: 71}
Regex.hx:9: [This is a URL!!, ways]
Regex.hx:10: This is a URL!!
Regex.hx:11: ways
I expected something along the lines of matching multiple positions i.e.
pattern.matched(?) => ["![demo](demo)", "![]()", "![urls]()", "![](whacky)"] // Perhaps such output is more suited for pattern.split(word)?
pattern.matchedPos() => [{pos: 15, len: 13}, {pos: 29, len: 5}, {pos: 71, len: 9}, {pos: 86, len: 11}]
This makes a bit more sense as there is <EReg>.match(s: String)
returns a boolean, as opposed to an array of the matched substrings of s
, but I don't understand how to handle the matching of multiple instances of a regex in a single string.
Does Haxe not support this currently, and is this something that is considered in the future?
@Mariosyian your regex want to eat all of string to maximize first match with .*
pattern. You can fix it with change to [^end_of_bracket_char]*
in both cases, there is nice site for debug regexes: https://regex101.com/r/R6Zkth/1
And actual code:
class Regex {
static function main() {
// var pattern: EReg = ~/!\[.*\]\(.*\)/i;
var pattern: EReg = ~/!\[[^\]]*\]\([^\)]*\)/gi;
var word: String = "This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways";
trace(word);
trace(pattern.match(word));
trace(pattern.matched(0));
trace(pattern.matchedPos());
trace(pattern.split(word)); // in-betweens
// array of matches with some formatting
trace(getMatches(pattern, word).join(" "));
}
static function getMatches(ereg: EReg, input: String, index: Int = 0): Array<String> {
var matches = [];
while (ereg.match(input)) {
matches.push(ereg.matched(index));
input = ereg.matchedRight();
}
return matches;
}
}
@RblSb Thank you extremely for both the advice and help! I've went ahead and rewrote your function as a way to retrieve the span of all substrings inside the original string (as this was my original purpose). Hopefully this might prove useful to someone down the line. NOTE: This function was NOT unit tested outside the context of my specific use-case.
class Regex {
static function main() {
var pattern: EReg = ~/!\[[^\]]*\]\([^\)]*\)/gi;
var word: String = "This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways";
trace(word);
// array of matches with some formatting
trace(getMatches(pattern, word));
trace(getTrueMatchPositions(pattern, word));
}
/**
* Gets all string matches of the given pattern within the input string and returns
* them as an array of strings.
*
* @param ereg The regular expression to match against.
* @param input The string to run the regular expression against.
* @param index The index of the match group.
* @return Array<String> An array of the matched substrings.
*/
static function getMatches(ereg: EReg, input: String, index: Int = 0): Array<String> {
var matches: Array<String> = [];
while (ereg.match(input)) {
matches.push(ereg.matched(index));
input = ereg.matchedRight();
}
return matches;
}
/**
* Gets all string matches of the given pattern within the input string and returns
* them as an array of anonymous objects, whose properties include the matched
* strings along with its span in the original string.
*
* @param ereg The regular expression to match against.
* @param input The string to run the regular expression against.
* @param index The index of the match group.
* @return Array<{match:String, span:{begin: Int, end: Int}}> A. array of the
* matched substrings and their span in the original string.
*/
static function getTrueMatchPositions(ereg: EReg, input: String, index: Int = 0): Array<{match:String, span:{begin:Int, end:Int}}> {
var matches: Array<{match:String, span:{begin: Int, end: Int}}> = [];
while (ereg.match(input)) {
var position: {pos: Int, len: Int} = ereg.matchedPos();
if (matches.length > 0) {
position = {
"pos": matches[matches.length - 1].span.end + position.pos,
"len": position.len,
};
}
matches.push(
{
"match": ereg.matched(index),
"span": {
"begin": position.pos,
"end": position.pos + position.len,
},
}
);
input = ereg.matchedRight();
}
return matches;
}
}
This is the output of the above class:
$ haxe --main Regex.hx --interp
Regex.hx:5: This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways
Regex.hx:7: [![demo](demo),![](),![urls](),![](whacky)]
Regex.hx:8: [{match: ![demo](demo), span: {begin: 15, end: 28}},{match: ![](), span: {begin: 29, end: 34}},{match: ![urls](), span: {begin: 62, end: 71}},{match: ![](whacky), span: {begin: 75, end: 86}}]
Using regular expressions - Beginner - Haxe programming language cookbook
In Haxe a regular expression starts with ~/ and ends with a single / and is of type EReg.
https://code.haxe.org/category/beginner/regular-expressions.html