ballerina-platform / ballerina-spec

Ballerina Language and Platform Specifications
Other
167 stars 53 forks source link

[lang.regexp] Clarification on `split()` behaviour when the delimiter is at the end of the string #1236

Closed pcnfernando closed 1 year ago

pcnfernando commented 1 year ago

Description: Please refer to the below scenario when the delimiter is found at the end of the string.

import ballerina/io;
import ballerina/regex;

public function main() returns error? {
    string data = "abc@";
    string[] emails = re `@`.split(data);
    io:println(emails); // ["abc", ""]
    string[] newEmail = regex:split(data, "@");
    io:println(newEmail); // ["abc"]
}

The current behavior coincides with the ECMA Script standard where we return an empty string as the last element of the resulting array If the delimiter string is found at the end of the input string. ref: :https://tc39.es/ecma262/#sec-string.prototype.split

But the Java behavior differs where it doesn't include the ending empty string by default. This might be useful in some scenarios to avoid ambiguity in cases where the delimiter is repeated at the end of the string.

import ballerina/io;

public function main() returns error? {
    string data = "one,two,three,";
    string[] splits = re `,`.split(data);
    io:println(splits); // ["one","two","three",""]
}

Appreciate your thoughts on this.

Suggested Labels:

Code sample that shows issue:

Related Issues:

jclark commented 1 year ago

The ECMAScript behavior is better.

When you say

This might be useful in some scenarios to avoid ambiguity in cases where the delimiter is repeated at the end of the string.

By "this" you mean the ECMAScript behavior, right?

jclark commented 1 year ago

The spec seems quite clear on this.

I think I may not be understanding your point.

pcnfernando commented 1 year ago

The ECMAScript behavior is better.

When you say

This might be useful in some scenarios to avoid ambiguity in cases where the delimiter is repeated at the end of the string.

By "this" you mean the ECMAScript behavior, right?

string str = "one,two,three,";

In this string, the delimiter string "," appears at the end of the string, which means that if the split() method were to include the ending empty string by default, it would be unclear whether the final element of the resulting array should be an empty string or not.

By excluding the ending empty string by default, Java's split() method avoids this ambiguity and ensures that the resulting array always contains the expected elements.

The spec seems quite clear on this.

Yes, the spec is clear on this. Just checking whether we should consider this as an improvement point. But I understand, current behaviour is better since its consistent.

pcnfernando commented 1 year ago

Thanks for the clarification.

jclark commented 1 year ago

My perspective is that the Java behaviour creates an ambiguity: if the result is ["one","two","three"], then you don't know whether the input was one,two,three or one,two,three,. For most delimited formats, these are not the same. With the Java semantics, if you need to parse something where the difference matters, you are out of luck. With the ECMAScript behaviour, you can explicitly ignore the empty last element if you want to.

The Java behaviour is useful when you are splitting on whitespace in human-produced input, but you can get the same result by using string:trim before split.