HaxeFoundation / haxe.org-comments

Repository to collect comments of our haxe.org websites
2 stars 2 forks source link

[code.haxe.org] Beginner - Using regular expressions #17

Open utterances-bot opened 5 years ago

utterances-bot commented 5 years ago

Using regular expressions - Beginner - Haxe programming language cookbook

In Haxe a regular expression starts with ~/ and ends with a single / and is of type EReg.

https://code.haxe.org/category/beginner/regular-expressions.html

brennanyoung commented 5 years ago

This might be obvious to some, but it would be worthwhile mentioning that backslashes need to be escaped when creating regular expressions using new Ereg() - and that this is not necessary with Regex literals. A couple of examples would make this 'gotcha' perfectly clear.

Mariosyian commented 2 years ago

Hello, I'm not sure if I'm missing something obvious here but I can't seem to find anything regarding the case where one line contains multiple instances of the same regex.

For example, I have this demo code for parsing markdown into HTML. Specifically the URL parsing.

// Regex.hx
class Regex {
    public static function main() {
        var pattern: EReg = ~/!\[.*\]\(.*\)/i;
        var word: String = "This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways";
        trace(word);
        trace(pattern.match(word));
        trace(pattern.matched(0));
        trace(pattern.matchedPos());
        trace(pattern.split(word));
        trace(pattern.matchedLeft());
        trace(pattern.matchedRight());
    }
}

Executing this code with $ haxe --main Regex.hx --interp yields the following result.

Regex.hx:5: This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways
Regex.hx:6: true
Regex.hx:7: ![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky)
Regex.hx:8: {pos: 15, len: 71}
Regex.hx:9: [This is a URL!!, ways]
Regex.hx:10: This is a URL!!
Regex.hx:11:  ways

I expected something along the lines of matching multiple positions i.e.

pattern.matched(?) => ["![demo](demo)", "![]()", "![urls]()", "![](whacky)"] // Perhaps such output is more suited for pattern.split(word)?
pattern.matchedPos() => [{pos: 15, len: 13}, {pos: 29, len: 5}, {pos: 71, len: 9}, {pos: 86, len: 11}]

This makes a bit more sense as there is <EReg>.match(s: String) returns a boolean, as opposed to an array of the matched substrings of s, but I don't understand how to handle the matching of multiple instances of a regex in a single string.

Does Haxe not support this currently, and is this something that is considered in the future?

RblSb commented 2 years ago

@Mariosyian your regex want to eat all of string to maximize first match with .* pattern. You can fix it with change to [^end_of_bracket_char]* in both cases, there is nice site for debug regexes: https://regex101.com/r/R6Zkth/1 And actual code:

class Regex {
    static function main() {
        // var pattern: EReg = ~/!\[.*\]\(.*\)/i;
        var pattern: EReg = ~/!\[[^\]]*\]\([^\)]*\)/gi;
        var word: String = "This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways";
        trace(word);
        trace(pattern.match(word));
        trace(pattern.matched(0));
        trace(pattern.matchedPos());

        trace(pattern.split(word)); // in-betweens
        // array of matches with some formatting
        trace(getMatches(pattern, word).join("     "));
    }

    static function getMatches(ereg: EReg, input: String, index: Int = 0): Array<String> {
        var matches = [];
        while (ereg.match(input)) {
            matches.push(ereg.matched(index));
            input = ereg.matchedRight();
        }
        return matches;
    }
}
Mariosyian commented 2 years ago

@RblSb Thank you extremely for both the advice and help! I've went ahead and rewrote your function as a way to retrieve the span of all substrings inside the original string (as this was my original purpose). Hopefully this might prove useful to someone down the line. NOTE: This function was NOT unit tested outside the context of my specific use-case.

class Regex {
    static function main() {
        var pattern: EReg = ~/!\[[^\]]*\]\([^\)]*\)/gi;
        var word: String = "This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways";
        trace(word);
        // array of matches with some formatting
        trace(getMatches(pattern, word));
        trace(getTrueMatchPositions(pattern, word));
    }

    /**
     * Gets all string matches of the given pattern within the input string and returns
     * them as an array of strings.
     *
     * @param ereg The regular expression to match against.
     * @param input The string to run the regular expression against.
     * @param index The index of the match group.
     * @return Array<String> An array of the matched substrings.
     */
     static function getMatches(ereg: EReg, input: String, index: Int = 0): Array<String> {
        var matches: Array<String> = [];
        while (ereg.match(input)) {
            matches.push(ereg.matched(index));
            input = ereg.matchedRight();
        }
        return matches;
    }

    /**
     * Gets all string matches of the given pattern within the input string and returns
     * them as an array of anonymous objects, whose properties include the matched
     * strings along with its span in the original string.
     *
     * @param ereg The regular expression to match against.
     * @param input The string to run the regular expression against.
     * @param index The index of the match group.
     * @return Array<{match:String, span:{begin: Int, end: Int}}> A. array of the
     *  matched substrings and their span in the original string.
     */
    static function getTrueMatchPositions(ereg: EReg, input: String, index: Int = 0): Array<{match:String, span:{begin:Int, end:Int}}> {
        var matches: Array<{match:String, span:{begin: Int, end: Int}}> = [];
        while (ereg.match(input)) {
            var position: {pos: Int, len: Int} = ereg.matchedPos();
            if (matches.length > 0) {
                position = {
                    "pos": matches[matches.length - 1].span.end + position.pos,
                    "len": position.len,
                };
            }
            matches.push(
                {
                    "match": ereg.matched(index),
                    "span": {
                        "begin": position.pos,
                        "end": position.pos + position.len,
                    },
                }
            );
            input = ereg.matchedRight();
        }
        return matches;
    }
}

This is the output of the above class:

$ haxe --main Regex.hx --interp
Regex.hx:5: This is a URL!!![demo](demo) ![]() with multiple instances of ![urls]() in ![](whacky) ways
Regex.hx:7: [![demo](demo),![](),![urls](),![](whacky)]
Regex.hx:8: [{match: ![demo](demo), span: {begin: 15, end: 28}},{match: ![](), span: {begin: 29, end: 34}},{match: ![urls](), span: {begin: 62, end: 71}},{match: ![](whacky), span: {begin: 75, end: 86}}]