[Spec] Whitespace Handling

codello commented 5 months ago

Suggestion

Hello. I have been looking for an official spec of the UltraStar file format for years and I'm very happy that there is an effort now to standardize the format.

I think the spec should specify how whitespace is handled in UltraStar files. This includes:

[x] What characters are recognized as whitespace? (This issue)
[x] What characters are recognized as line breaks? #43
[ ] How are empty lines handled? #50
[ ] In what places of the file is whitespace significant? (future issue)

What characters are recognized as whitespace?

USDX recognizes U+0009 (Horizontal Tab) and U+0020 (Space) as whitespace characters (except for P 1, P 2 markers, where only a Space is recognized). I suggest that the spec codifies this and specifies that each a single tab and a single space are recognized as whitespace. Specifically no other unicode characters from the "Whitespace" category (such as U+00A0 non-breaking space or U+2002 en space) are recognized as whitespace by the specification.

Given that I have personally not seen many songs use tabs instead of spaces a simpler approach could also be to only allow the Space character.

What characters are recognized as line breaks?

This question is also discussed in #43.

I think USDX recognizes U+000A (Line Feed, \n), U+000D (Carriage Return, \r) and Windows-Style line breaks (\r\n). Given the various platforms and programming languages that are used to process songfiles I suggest the spec should specify all three types of line breaks as valid (single \n, single \r, and \r\n). I also suggest for the the spec to recommend \n as the preferred line break character that programs should use when writing songfiles.

How are empty lines handled?

The only place where USDX allows empty lines is between the header and the first note (and after the End). In fact USDX seems to allow any text between the header and the first note (or technically the first line starting with a note character). I suggest the spec clarifies where empty lines are allowed and where not. I suggest that the spec disallows empty lines for versions <2.0.0 and allows empty lines for version >=2.0.0. I think allowing empty lines in a textfile can improve readability but allowing empty lines would obviously be a breaking change.

Alternatively the spec could mandate that textfiles must not include empty lines but could allow parsers to ignore empty lines if present. This is inspired by the JSON spec doing something similar for the BOM. Only allowing empty lines between the header and the first note seems kind of arbitrary to me.

Note: Maybe the phrasing of handling the byte order mark might be interesting for this spec as well:

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

In what places of the file is whitespace significant?

The spec should clarify where whitespace is ignored and where it is preserved. I can think of the following relevant places:

USDX ignores whitespace at the beginning and end of header keys and values. The following are equivalent:

# AUDIO : foobar
#AUDIO:foobar

I suggest that the spec codifies this. One implication of this decision is that audio/video/image files cannot be referenced if their name starts or ends with a space, but I think this is an acceptable limitation (and is the current behavior of USDX anyways). Similarly the spec should clarify how whitespace in comma-separated lists (#30) are to be handled. I suggest that spaces around these commas should be ignored and that the following are equivalent:

#GENRE:Rock,Pop
#GENRE:Rock, Pop

USDX supports a single space after a P marker. So the following are equivalent

P1
P 1

I suggest that for version >=2.0.0 only P1 is valid.

USDX parses a note line as follows:

: 10 8 6 Some
 ^  ^ ^ ^
 |  | | Exactly one space
 At least one space

The last "exactly one space" is obviously important for notes whose text starts with a space. I suggest that for consistency the spec only allows a single space in the other places as well. The same would hold for line breaks (- 12 and - 12 12 in relative mode).

Use case

When parsing or generating an UltraStar file it is important to have a standardized approach to whitespace handling. This is even more important since there are some places in a textfile where whitespace is significant. A clear specification helps developers write correct parsers and helps prevent different programs from generating incompatible files.

I realize that in practice most textfiles probably don't differ as much as the proposal makes it out to be. I still think it's important that the spec is precise in this regard.

Extra info/examples/attachments

I have only really checked the USDX implementation of the textfile parser and don't currently know how other implementations approach these specifics.

I realize that this is quite a long issue. Please tell me if I should split this into multiple issues to be discussed separately.

basisbit commented 5 months ago

just a short note regarding this topic: when standardizing this, please keep in mind that one of the reasons for the success of the UltraStar txt file system was that people are able to easily open it with a random text editor and adjust it. This should stay possible in the future, despite random text editors probably not knowing the usdx format specification requirements ;-)

RumovZ commented 5 months ago

I wonder if it wouldn't be easier for implementations to adhere to the spec if 'whitespace' included any Unicode whitespace. Modern programming languages typically supprt trimming and splitting on Unicode whitespace out of the box. Of course, this most likely won't matter in practice.

Everything else sounds good! 👍

codello commented 5 months ago

I'd be fine with that as well and I agree that it would probably simplify implementations (except if people use regex, where often \s only includes space and tab).

This would technically break compliance of USDX (and probably other implementations) with the 0.x versions of the spec (because USDX only recognizes space and tab as whitespace). Alternatively this could be a 2.0.0 breaking change but that would make a compliant implementation pretty awkward as it would technically have to guess valid whitespace when checking the #VERSION header and then make sure that the guess was correct.

I'd definitively go the route of retroactively breaking compliance for existing implementations because – as you said – in practice this likely won't matter at all (especially since existing songs don't use other whitespaces than space and maybe tab).

There's also an argument to be made to recognize all Unicode line terminators as valid line terminators although I have never really seen them used. This also could cause issues with certain #ENCODINGs.

RumovZ commented 5 months ago

The value of #VERSION cannot contain whitespace, so it should be parsable regardless, shouldn't it?

But I see, it's complicated. Maybe it'd be best to keep it simple and live with implementations taking some liberties here.

codello commented 5 months ago

The #VERSION value cannot contain whitespace but depending on the decision in this issue we might want to ignore whitespace around the : and after the # (the whitespace wouldn't technically be part of the value).

I think at some point we will have to accept that implementations will in practice probably just use the tools of the respective programming languages and not adhere to the spec 100% (which is fine by me).

codello commented 4 months ago

I think it makes sense to discuss the set of whitespace characters in this issue and open new issues for the question of handling empty lines and significant whitespace.

A quick search leads me to the conclusion that there are basically two approaches to trimming whitespace in programming languages:

Treat any character/byte that is less than or equal to an ascii space (u0020) as a whitespace character, including any ascii control characters ([Java](https://docs.oracle.com/javase%2F7%2Fdocs%2Fapi%2F%2F/java/lang/String.html#trim()), Pascal)
Treat any character as whitespace that is defined as White_Space by Unicode (list of characters). This is the approach taken by Python, C#, and Go for example.

A third option for the spec could be to restrict the set of allowed white space characters, e.g. to just space and tab. Some programming languages seem to do this for example.

Existing implementations behave differently:

USDX seems to mostly recognize space and tab as whitespace.
Vocaluxe seems to recognize Unicode whitespace in headers but only an ascii space in the notes.
USDB_Syncer does not recognize any whitespace in headers and recognizes only ascii spaces in notes.
UltraStar Manager seems to recognize the ascii subset of Unicode spaces.
I am unsure about Performous.

I think that the most “compatible” set of whitespace characters would be just an ascii space and nothing else. In practice this hardly matters but if we want to make the format editable by humans I think allowing all Unicode spaces would be the way to go. If a file contains a Unicode whitespace this could be very hard to identify for non technical users.

In the interest of compatibility we should maybe recommend that implementations only use the ascii space when generating song txts. This seems to be common practice already.

Maybe there are some more comments on this, otherwise we could maybe start voting in a week or so.

codello commented 3 months ago

Summary / Vote now

Are we ready to vote on this? As a quick reminder: This is about the question what constitutes a "whitespace" in the context of the file format. I think the following options have been mentioned:

🚀 A) Only an ASCII space U+0020
🎉 B) ASCII Space and tab U+0020, U+0009
❤️ C) Any character with a code point less than or equal to an ASCII space U+0020
👀 D) Any character defined as White_Space by Unicode (see Wikipedia for a list of the 25 code points), excluding line feed U+000A and carriage return U+000D (these are newlines as per #43)

Please vote via emoticon reaction.

marwin89 commented 3 months ago

thanks for summary @codello . yes let's vote now till 31th march.

codello commented 2 months ago

Sorry for the delay. The decision seems to be unanimous. I have created PR #53 to add this result to the spec.

marwin89 commented 2 months ago

I approved and merged the pr #53 .

UltraStar-Deluxe / format