Open 2colours opened 1 year ago
My impression is that not taking all whitespaces into account for quoting structures was a sincere mistake and it could simply be a Rakudo bug. However, the sole fact that the word quotes don't just identify whitespaces, unlike words
, might hint something intentional. If it was intentional, the documentation needs to take that into account and wrong examples need to be updated.
My impression is that not taking all whitespaces into account for quoting structures was a sincere mistake and it could simply be a Rakudo bug.
I'm inclined to agree. As additional evidence, if you store a string with nonbreaking spaces in a variable and then interpolate it, those nonbreaking spaces are treated as word separators:
# NOTE: all spaces below are non-breaking
say <a b c>.raku; # OUTPUT: "a b c"
my $str = 'a b c';
say <\qq[$str]>.raku; # OUTPUT: ("a", "b", "c")
Unless someone disagrees, I'd say its fine to close this issue and open a Rakudo one.
Well…
The question is what IS the correct one? Nonbreaking spaces are tricky. We definitely consider 1,000,000
to be a single word. But oftentimes, thousands are separated with non-breaking spaces. They can also be used to designate to strings of characters that are conceptually a single word, but for whatever reason may be split visually.
OTOH, they might be used for something as simple as preventing a break between a quote and a parenthetical notation (which isn't conceptually a single word).
There is no perfect way, and the option is either break at all whitespace, or break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).
There is no perfect way, and the option is either break at all whitespace, or break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).
Agreed. And, given that the current implementation breaks on almost all spaces, it seems like it's better (and less breaking) to go in that direction. And that seems least-bad to me, anyway.
(Incidentally, one area that always plays havoc with word counting/word division is legal citations. How many words should ideally be in 42 U.S.C. § 405(a)
? Does your answer change if I tell you that the last space (but no others) is non-breaking?)
So, yeah, no perfect way.
Agreed. And, given that the current implementation breaks on almost all spaces, it seems like it's better (and less breaking) to go in that direction. And that seems least-bad to me, anyway.
(Incidentally, one area that always plays havoc with word counting/word division is legal citations. How many words should ideally be in
42 U.S.C. § 405(a)
? Does your answer change if I tell you that the last space (but no others) is non-breaking?)So, yeah, no perfect way.
So there's really two alternatives, and thankfully Raku allows for a module to fill in the other:
.words
and qw
are effectively equivalent to .comb(/<:!Z>+/)
.words
and qw
are effectively equivalent to .comb(/<:!Z+[ ]>+/)
(where the space between brackets there is the non-breaking characters)I'd probably personally go for the first one except I recall that Larry once said that he wanted the .words
and .lines
methods precisely so people weren't creating bad regexen to do the job and end up missing something. The question ultimately goes down to which of the two is the most likely to be the DWIM thing. After all, any whitespace that's not space, tab, or newline is probably inserted rather intentionally.
But I think the first one is ultimately the simplest to explain, and anyone who does need to worry about them can (at least for .words
) make a very easy modification in module space, etc. (And probably soon qw
could too once slangs are more robust)
For what it's worth, I also think it's easier to "make peace" by sticking to the "all whitespaces separate words" concept. The way I see it, a non-breaking space is usually rather about visual presentation than the number of words. For example, you wouldn't want to break a movie title or something similar that strongly represents one concept. I don't know the definition of a "word" but in my mind, such a... well, string? would still consist of multiple words, just presented in a certain way.
For numbers, this explanation may be less useful but I wouldn't call a sequence of digits a "word", whether it's separated by whitespace or something else - anyway, I wonder, does Raku parse nbsp separated numbers in the first place? If not then I don't think this is something to really take into account for the concept of words.
does Raku parse nbsp separated numbers in the first place
No, but it also doesn't parse comma separated numbers either (it only allows for underscores to space digits). It does treat comma separated numbers as a single word for the purpose of .words
though. The number isn't a theoretical thing, though. Since CLDR has started using both full and narrow non-break spaces in many of its number formats for major world languages, there's quite a few numbers in the real world floating around that are spaced accordingly and that number will grow. I would think most people understand that if .words
slurps up words, it would slurp up different formats of numbers (such as 123,456.789
) and need to reparse those accordingly.
I don't know the definition of a "word" but in my mind, such a... well, string? would still consist of multiple words, just presented in a certain way.
Just as a background, the definition of a word is fairly nebulous. In English, in fact, we very frequently will see a progression in terms from foo bar to foo-bar to foobar (these are called, respectively, open, hyphenated, and closed compound words), but not all words go the full path (ice cream is just a conceptually a single thing as rainbow, and personally given that the former is a trochee and the latter a spondee, I'd actually argue the former should be one word and the latter two). Different languages can display different examples of how what's one word is really multiple or multiple words are really one (Spanish gives the wonderful example of both: se lo dije vs díjeselo, where the words/affixes are written separate or together based on position, but they are still pronounced as a single word unit either way).
But that's just background, for the purpose of words that might have internal spaces, I'd agree that they should probably be expected to be split. The question might revolve more around what's going to be more common: encountering a purely formatting space with words, or encountering a
I suppose we could split the baby by enforcing number boundaries (where a non-breaker would be considered a part of the word if surrounded on both sides by a number, thus units would be split but not the numbers that make them up, as those often also have NBSPs), but that's one of those extra complexities that I'm not sure if it's better to leave it to a module (because it would induce more surprise if default) or make it a part of it (because despite the complexity, it would produce the least complexity for the typical user).
No, but it also doesn't parse comma separated numbers either (it only allows for underscores to space digits).
To be honest, I also don't "parse" comma separated numbers as one number, either. :P Or well, if it has exactly one comma, I would parse it as a fractional... anyway, I'm perfectly fine with settling on neither being one word, by this highly arbitrary and superficial definition of words.
(...) what's going to be more common: encountering a purely formatting space with words, or encountering a
Something seems to be missing here, doesn't it?
Anyway, I think we are reaching further and further from the issue of generic word-splitting. Many of these things could be addressed by providing built-in regexes/tokens - for example, I would be happy if the patterns provided by the Rakudo parser could be accessed some way - especially since it uses many standardized concepts probably so it could easily be "more Raku than Rakudo" per se.
This is more food for thought than anything else, but here are what a few other programs make of the string foo bar baz
(with nbsp)
@alabamenhu said:
... break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).
Agreed.
Maybe words
which respects &NBSP
, and Words
which doesn't?
and/or add a named argument to words
something to the effect of :nbsp-family
defaulting to False
?
Hello,
The TL;DR of this issue would be: non-breaking spaces are handled differently by words and word quoting structures despite both only talking about whitespaces. This also makes a number of doc code examples wrong about their output.
The process of discovery was the following:
https://docs.raku.org/language/traps#___top "using Set subroutines (...)" part
"a b"
and returnsFalse
for both code exampleshttps://docs.raku.org/language/quoting#index-entry-quote_%3C%3C_%3E%3E-quote_%C2%AB_%C2%BB-Word_quoting_with_interpolation_and_quote_protection:_%C2%AB_%C2%BB
<<>>
but that way also doesn't match the output in the docs:("42 b", "c ")
reason: non-breaking space!
https://docs.raku.org/language/quoting#Word_quoting:_qw says:
words does seem to match this description and produce the supposed output with non-breaking spaces as well
both can make sense but which one is correct?