amber-lang / amber

💎 Amber the programming language compiled to Bash
https://amber-lang.com
GNU General Public License v3.0
3.81k stars 81 forks source link

[Feature] Ability to use extended regex syntax in sed #444

Closed hdwalters closed 4 days ago

hdwalters commented 2 weeks ago

I note that standard library function replace_regex() transpiles to the following sed command:

echo "${source}" | sed -e "s/${pattern}/${replacement}/g"

sed basic regex syntax is horrible, as you have to unexpectedly (IMHO) escape regex tokens like capturing parentheses (...), and it doesn't support the "one or more instances" token +. I therefore use sed -E, and it would be nice to have that option.

I believe the aim of Amber is to be as widely applicable as possible, and I understand that some Bash distributions do not support sed -E or the equivalent sed -r, so we can't necessarily just change it in the standard library. I would therefore like to propose a breaking change:

pub fun replace_regex(source: Text, pattern: Text, replacement: Text, extended: Bool): Text {
    unsafe {
        if extended:
            return $echo "{source}" | sed -E -e "s/{pattern}/{replacement}/g"$
        else:
            return $echo "{source}" | sed -e "s/{pattern}/{replacement}/g"$
    }
}

It may even be possible to provide default parameters to Amber functions, but I do not believe this is currently implemented.

Thoughts?

Mte90 commented 2 weeks ago

We have the support for default value for function parameters in the alpha in development. I thin that if we proceed with this we need a way to support all the sed available around.

hdwalters commented 2 weeks ago

if we proceed with this we need a way to support all the sed available around.

I realise there may be sed implementations which do not support the -E option, but can you name any which use a different option for the same functionality? AIUI -E is pretty universal where extended regexes are supported.

I would respectfully suggest that if anyone does have a non standard sed, they can just pass false in the new parameter, and use the standard regex syntax.

hdwalters commented 2 weeks ago

We have the support for default value for function parameters in the alpha in development.

Great! Can you point me at an example please?

Mte90 commented 1 week ago

https://github.com/amber-lang/amber/blob/master/src/std/env.ab#L4

hdwalters commented 1 week ago

So this is not a breaking change after all:

pub fun replace_regex(source: Text, pattern: Text, replacement: Text, extended: Bool = false): Text {
    unsafe {
        if extended:
            return $echo "{source}" | sed -E -e "s/{pattern}/{replacement}/g"$
        else:
            return $echo "{source}" | sed -e "s/{pattern}/{replacement}/g"$
    }
}
b1ek commented 1 week ago

iirc, sed's syntax differs a bit with gnu and bsd versions. im pretty sure alpine has its own stripped down sed too. the bsd version is used on all macOS systems, and gnu is on most linux ones

b1ek commented 1 week ago

yup, i was right about alpine:

➜  ~ docker run -it alpine:3.20 ash 
/ # sed --help
BusyBox v1.36.1 (2024-06-10 07:11:47 UTC) multi-call binary.

Usage: sed [-i[SFX]] [-nrE] [-f FILE]... [-e CMD]... [FILE]...
or: sed [-i[SFX]] [-nrE] CMD [FILE]...

    -e CMD  Add CMD to sed commands to be executed
    -f FILE Add FILE contents to sed commands to be executed
    -i[SFX] Edit files in-place (otherwise write to stdout)
        Optionally back files up, appending SFX
    -n  Suppress automatic printing of pattern space
    -r,-E   Use extended regex syntax

If no -e or -f, the first non-option argument is the sed command string.
Remaining arguments are input files (stdin if none).
/ # 
b1ek commented 1 week ago

i think we could check sed version like this:


version=$(sed --version 2>&1)

if echo $version | grep 'Copyright (C) ' | grep ' Free Software Foundation, Inc.' 2>&1 > /dev/null; then
    echo is gnu
fi
if echo $version | grep 'This is not GNU sed version ' 2>&1 > /dev/null; then
    echo is alpine (busybox)
fi
if echo $version | grep 'sed: illegal option -- -' 2>&1 > /dev/null; then
    echo is bsd
fi
hdwalters commented 1 week ago

My proposal is to default to -E in the absence of more definitive knowledge, so I do not think we need to test explicitly for Alpine sed, as it will fall under that default. This applies to BSD sed as well, as the current version supports -E (see https://man.freebsd.org/cgi/man.cgi?sed).

So I think all we really need to test for is "GNU or not GNU"; we might as well simplify the generated Bash output, especially as Mte90 requested that we test the flavour each time replace_regex() is called.

I am happy to search for \bCopyright\b.+\bFree Software Foundation\b, but I suspect that \bGNU sed\b will be slightly faster, with little risk of false positives.

hdwalters commented 1 week ago

I do also wonder how standardised the actual extended regex syntax is across flavours, but quite honestly it's going to be a losing game trying to correct for any disparities, and I think we can reasonably leave that to Amber script authors (it might be handy to provide a way of testing the OS type, but that's a topic for another day).

hdwalters commented 1 week ago

Tested on MacOS:

xxx@xxx-MacBook-Air-2 ~ % echo 'abc123def' | sed -E -e 's/([0-9]+)/[\1]/'
abc[123]def
b1ek commented 1 week ago

This applies to BSD sed as well, as the current version supports -E (see https://man.freebsd.org/cgi/man.cgi?sed).

they do not support the same regexes, and their engines differ a little bit

hdwalters commented 1 week ago

they do not support the same regexes, and their engines differ a little bit

I suspected as much, but I really don't think there's a lot we can do about masking those differences. After all, script authors would have to tailor the regex syntax if they were calling sed from a Bash script.