Open pbiggar opened 12 months ago
Wow, this one is quite the sight 🙈.
I don't have any immediate idea what could be going wrong here.
We normally just copy the string straight from the source text. So I didn't expect unicode strings to matter in this case.
String.toList "Zzzzzzzzzzzz" = [ c "Z̤͔ͧ̑̓"; c "ä͖̭̈̇"; c "lͮ̒ͫ"; c "ǧ̗͚̚"; c "o̙̔ͮ̇͐̇" ]
didn't reproduce it for me, so I'm not sure what to make of this one.
What I imagine is happening here is that we're using the Length
of a string to determine indentation, as opposed to the number of Extended Grapheme Clusters (which correspond to visual on-screen characters).
I checked to see how we do this in Darklang and found this:
System.Globalization.StringInfo(s).LengthInTextElements
That is an interesting pointer, thanks! I'll try and follow-up on this.
Hmm, the range of "ä͖̭̈̇"
in c "ä͖̭̈̇"; a
seems to be wrong. (online tool)
SynExpr.Const(constant = SynConst.String(text = "ä͖̭̈̇", synStringKind = SynStringKind.Regular, range = R("(1,2--1,10)")), range = R("(1,2--1,10)"))
That is definitely part of the problem and would need a fix on the compiler side.
Most likely lhs parseState
in doesn't do unicode well.
https://github.com/dotnet/fsharp/blob/681069fbebcdff312645e61e4970a6dd403ff0ee/src/Compiler/pars.fsy#L3289-L3291
EDIT: Maybe not. Not sure what to make of that.
I guess the range is using the number of bytes, and I'm guessing this is an 8 byte unicode character. Fantomas clearly needs to use unicode length, but I don't know about the compiler. Is the compiler's range field supposed to be length in bytes or length on screen? If it's trying to do error reporting, it might be length on screen (in which case it's incorrect here).
Could fantomas use the text to get the EGC length instead of using range? I'm guessing that would fix it (though it might miss other cases like eg Match Patterns).
Is the compiler's range field supposed to be length in bytes or length on screen?
I believe that will be the length on the screen.
Could fantomas use the text to get the EGC length instead of using range?
Possibly, so Fantomas doesn't use the string value that is stored in the AST because it is an optimized representation of the value. There are multiple scenarios where this is beneficial.
When we grab the string from the source text we probably need to do something more clever when there is unicode involved in https://github.com/fsprojects/fantomas/blob/e671f3d7c68a258d80f6440ea82aaada2c48a34d/src/Fantomas.Core/ISourceTextExtensions.fs
We can detect the difference between EGC and range:
in
I'm just not really sure, how to extract the right thing from the ISourceText
.
Issue created from fantomas-online
Code
Result
Problem description
This is formatted weiiiiiird! Though the code is valid still.
FYI, we have lots of F# unicode edgecases in the darklang codebase, esp in string.dark
Extra information
Options
Fantomas main branch at 1/1/1990