bcpierce00 / unison

Unison file synchronizer
GNU General Public License v3.0
4.08k stars 229 forks source link

[doc] Unclear wording about supported/unsupported Regex syntax #949

Closed rulatir closed 1 year ago

rulatir commented 1 year ago

In the section "Path Specification" of the official documentation, please consider documenting expressly which of these constructs are supported and which are not:

  1. [0123456789]
  2. [0-9]
  3. \d
  4. (0|1|2|3|4|5|6|7|8|9)
  5. (?:0|1|2|3|4|5|6|7|8|9)
  6. [:digit:]

In the wild, the term "character classes" is so often misused, used imprecisely, used juuuuuust-precisely-enough-to-trick-you-into-believing-that-something-might-be-possible, etc., that an average reader of the documentation cannot trust that any specific use of it is precise. For example, if "character classes" specifically means named character classes (№ 6 in the list) then that might be good news, because constructs like [0123456789], which are sometimes referred to as "character sets", might still be supported. OTOH if the term is really used with the exact precise meaning of the POSIX specification then that's probably bad news, because when such a big part of regex as (general) character classes is unsupported, then who will brave a bet on parenthesized subexpressions and alternatives?

(edited to point to specific section of the documentation and to provide further rationale)

gdt commented 1 year ago

This report lacks a refernence to where there is any mention of regex, and which version of what source repo artifact you reviewed. I'm not claiming that the essence of what you are reporting is untrue -- but this report is not as helpful as it could be.

(And, almost certainly, the wording shoudl be clarified to refer to existing standards, rather than open-coding them.)

rulatir commented 1 year ago

As an end user (the typical bug reporter demographic), I am not referring to source repo artifacts. I am referring to compiled, rendered documentation of the current stable release, as available at https://raw.githubusercontent.com/bcpierce00/unison/documentation/unison-manual.txt. I am specifically referring to the section "Path Specification", and the following parenthesis:

(The collating sequences and character classes of full Posix regexps are not currently supported).

I also believe that:

Note that not including such information requires additional trust from the documentation reader: namely they must trust that the documentation author themselves are using the standard's terminology correctly and unambiguously.

This kind of trust is not possible in this case. I have seen accomplished programmers and computer scientists use the term "character class" to mean the following sets of constructs from the numbered list in my report:

All of them referred to a regex standard in the same sentence. Only some of them were in fact using the term in accordance with its definition in the standard they were referring to. Looking at a use of the term "character class" in documentation, I cannot possibly know if the author uses the term correctly (with respect to the standard mentioned), because it is my experience that incorrect, ambiguous and utterly confused use of that term is a frequent occurrence in the wild.

gdt commented 1 year ago

Looking at the code briefly, It seems likely that items 4, 5, and 6 of section 9.3.5 are not supported, but 1,2,3 are.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05

The use of 'character class" is clear, but the documentation says "collating sequences" when it should probably say "collating symbol" and "equivalence class expression". The source code (src/base/rx.ml) seems to be a regexp implementation from a simpler time.

To make progress, you could test with various expressions and add a comment with what worked and what didn't. (I don't even use regexps, so I don't have this problem.) It would then be reasonable to clarify the language.

If you 'd like to talk about changing the source to more fully support POSIX EREs, please bring that up on unison-hackers.

rulatir commented 10 months ago

There is a typo in the commit ("expresssion").

gdt commented 10 months ago

fixed, thanks.