firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com
3.2k stars 198 forks source link

Special characters should be escaped within bracket list, e.g., "-" #2255

Closed andersweister closed 2 months ago

andersweister commented 2 months ago

Bug Description

The "-" is used for range, so it must be escaped when matching the character itself.

These four characters require escape sequence inside the bracket list: ^, -, ], .

Reproduction steps

[a-zA-Z0-9-]+ notice the un-escaped "-" at the end, which is illegal in major implementations.

Test string: pqr-456

Unfortunately gives no warning and accepts the test string.

Expected Outcome

The following is correct: [a-zA-Z0-9\-]+ one back-slash before the last hyphen (doubled just for this markdown source).

Reference: https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

Browser

Chromium Version 123.0.6312.105

OS

Linux Ubuntu 20.04.6 LTS

working-name commented 2 months ago

Hi @andersweister,

Can you let me know what you're using that gives you errors about the - at the end/beginning of a character class?

I can't seem to make the vanilla stuff croak.

PCRE2

PCRE2 version 10.43 2024-02-16 (8-bit)
  re> /[a-z-]+/
data> testing-this
 0: testing-this

Python

Python 3.8.10 (default, Nov 22 2023, 10:22:35)
Type "help", "copyright", "credits" or "license" for more information.
>>> import re; r = re.compile("[a-z-]+"); r.match("testing-this")
<re.Match object; span=(0, 12), match='testing-this'>

Javascript

"testing-this".match(/[a-z-]+/)
0: "testing-this"
groups: undefined
index: 0
input: "testing-this"

.NET

http://regexstorm.net/tester?p=%5ba-z-%5d%2b&i=testing-this

andersweister commented 2 months ago

I saw it in a browser based JavaScript application using unicode UTF-8 that was validating an input field. I suspect different tools may parse the hyphen differently, which is unfortunate and can be confusing. Similar is discussed here:

Escaping the hyphen using - is the correct way. https://stackoverflow.com/questions/3697202/including-a-hyphen-in-a-regex-character-bracket

Escape the hyphen using - as it usually used for character range inside character set. https://stackoverflow.com/questions/34916716/regular-expression-to-match-alphanumeric-hyphen-underscore-and-space-string

working-name commented 2 months ago

Unfortunately Regex is not a universally agreed upon standard. It's a tool that evolves differently in different times and programming languages. People call them "flavors", which is quite the apt description.

So yes, I agree that it gets confusing. Most people look up the regex documentation for the particular tool they use... which hopefully saves some time and headache.

As far as the flavors supported on regex101.com, they act/behave according to their own documentation, where - can be used at the beginning or end of a character class without the need to escape it with a \ or another -. You can still use \- if you so desire, which then no longer has the requirement to be the first or last character in the character class.

If you would like to discuss this further please feel free to comment here, or join us on IRC or discord.