fsprojects / fantomas

FSharp source code formatter
https://fsprojects.github.io/fantomas
Other
764 stars 190 forks source link

Unicode leads to incorrect indentation #2945

Open pbiggar opened 12 months ago

pbiggar commented 12 months ago

Issue created from fantomas-online

Code

String.toList "Zalͮ̒ͫgoZalͮ̒ͫgo" = [ c "Z"; c "a"; c "lͮ̒ͫ"; c "g"; c "o" ]

Result

String.toList "Zalͮ̒ͫgoZalͮ̒ͫgo" = [ c "Z"
                                                                             c "a"
                                                                             c "lͮ̒ͫ"
                                                                             c "g"
                                                                             c "o" ]

Problem description

This is formatted weiiiiiird! Though the code is valid still.

FYI, we have lots of F# unicode edgecases in the darklang codebase, esp in string.dark

Extra information

Options

Fantomas main branch at 1/1/1990

    { config with
                IndentSize = 2 }
nojaf commented 11 months ago

Wow, this one is quite the sight 🙈. I don't have any immediate idea what could be going wrong here. We normally just copy the string straight from the source text. So I didn't expect unicode strings to matter in this case. String.toList "Zzzzzzzzzzzz" = [ c "Z̤͔ͧ̑̓"; c "ä͖̭̈̇"; c "lͮ̒ͫ"; c "ǧ̗͚̚"; c "o̙̔ͮ̇͐̇" ] didn't reproduce it for me, so I'm not sure what to make of this one.

pbiggar commented 11 months ago

What I imagine is happening here is that we're using the Length of a string to determine indentation, as opposed to the number of Extended Grapheme Clusters (which correspond to visual on-screen characters).

I checked to see how we do this in Darklang and found this:

System.Globalization.StringInfo(s).LengthInTextElements

nojaf commented 11 months ago

That is an interesting pointer, thanks! I'll try and follow-up on this.

nojaf commented 11 months ago

Hmm, the range of "ä͖̭̈̇" in c "ä͖̭̈̇"; a seems to be wrong. (online tool)

SynExpr.Const(constant = SynConst.String(text = "ä͖̭̈̇", synStringKind = SynStringKind.Regular, range = R("(1,2--1,10)")), range = R("(1,2--1,10)"))

That is definitely part of the problem and would need a fix on the compiler side.

Most likely lhs parseState in doesn't do unicode well. https://github.com/dotnet/fsharp/blob/681069fbebcdff312645e61e4970a6dd403ff0ee/src/Compiler/pars.fsy#L3289-L3291

EDIT: Maybe not. Not sure what to make of that.

pbiggar commented 11 months ago

I guess the range is using the number of bytes, and I'm guessing this is an 8 byte unicode character. Fantomas clearly needs to use unicode length, but I don't know about the compiler. Is the compiler's range field supposed to be length in bytes or length on screen? If it's trying to do error reporting, it might be length on screen (in which case it's incorrect here).

Could fantomas use the text to get the EGC length instead of using range? I'm guessing that would fix it (though it might miss other cases like eg Match Patterns).

nojaf commented 11 months ago

Is the compiler's range field supposed to be length in bytes or length on screen?

I believe that will be the length on the screen.

Could fantomas use the text to get the EGC length instead of using range?

Possibly, so Fantomas doesn't use the string value that is stored in the AST because it is an optimized representation of the value. There are multiple scenarios where this is beneficial.

When we grab the string from the source text we probably need to do something more clever when there is unicode involved in https://github.com/fsprojects/fantomas/blob/e671f3d7c68a258d80f6440ea82aaada2c48a34d/src/Fantomas.Core/ISourceTextExtensions.fs

We can detect the difference between EGC and range: image

in

https://github.com/fsprojects/fantomas/blob/0ce91b778f216c2d7fa8286d1aa4aa5dbf835bcc/src/Fantomas.Core/ASTTransformer.fs#L87-L94

I'm just not really sure, how to extract the right thing from the ISourceText.