firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com
3.28k stars 199 forks source link

Code Generator for C# creates broken syntax for 'pattern' string. #2186

Closed ZStoner closed 11 months ago

ZStoner commented 12 months ago

Bug Description

In the C# Code Generator, the pattern string is using (/Slashies/) Regex syntax instead of (@"Quoted") syntax. This is clearly a C# syntax error. The closest existing generator languages that create this syntax appear to be PHP and Ruby. JavaScript is similar but different.

Reproduction steps

Generated Code (C#):

using System;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        string pattern = /
([[:punct:]]+)|(\w+)
/;
        string input = @"any word character !!!";

        foreach (Match m in Regex.Matches(input, pattern))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    }
}

C# Syntax Error:

        string pattern = /
([[:punct:]]+)|(\w+)
/;

Expected Outcome

The pattern variable should look something like this for C#...

        string pattern = @"([[:punct:]]+)|(\w+)";

Browser

Edge 119.0.2151.72

OS

Windows 11 22H2 (22621.2715)

OnlineCop commented 11 months ago

@ZStoner: POSIX-notation character classes, such as [[:punct:]], are not valid in .NET (at least according to regex101 when the .NET flavor is selected): https://regex101.com/r/DJjPfN/1

The Code Generator is expected to do a minimal amount of changes:

  1. Validity of the PATTERN is done at the FUNCTION level (Match, Substitution, etc.), not the TOOLS level (like Code Generator).
  2. The delimiter specified in the currently-selected FLAVOR is used, even if it is not valid for the selected LANGUAGE.
  3. Each LANGUAGE has its own formula to escape special characters in PATTERN, taking the current delimiter into account.

.NET stores regex PATTERNs in "quoted strings", so regex101 uses quotes for its delimiters.

Using string literals reduces the complexity of escaping special characters in C# strings, so most \ characters are written without conversion.

What you're seeing is an effect of using PCRE2 FLAVOR but with C# LANGUAGE, where the delimiter isn't changed to one of C#'s "quoted string" types.