bblfsh / sdk

Babelfish driver SDK
GNU General Public License v3.0
23 stars 27 forks source link

Splits suffixes in comments correctly #434

Closed ncordon closed 4 years ago

ncordon commented 4 years ago

The suffixes of the comments in semantic tree were not split correctly, due to slices of strings working with bytes in Go, and not with unicode characters. For example, without this fix, in:

// “comment”

the final was translated into b'\xe2\x80\x9d', but the suffix was tell to cut the slice of bytes in the position i, where i is the \xe2, not the final of the unicode character (which would be x9d). And that's because a unicode character can span more than one byte.

Closes #433

Signed-off-by: ncordon nacho.cordon.castillo@gmail.com


This change is Reviewable

ncordon commented 4 years ago

Sorry @dennwc, I was gathering strength to change the whole method to runes. Even if it is not so urgent it can be a good exercise for me to become more fluent in Go. So I am going to try doing it :smiley_cat:

ncordon commented 4 years ago

uast/transformer/semantic.go, line 199 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
@ncordon Sorry, it was my mistake. I see now why this specific change works. The only suggestion would be to change `DecodeRune` to `DecodeRuneInString` to avoid conversion to `[]byte`. Also, since you feel uncomfortable with changing the code to `[]rune`, please add a `TODO` here and open an issue in SDK. I think we still need to address the same issue for the prefix (if it contains other Unicode space-like characters). I left a few notes here to point places relevant to an issue with prefixes. Those are just for the context, no action is needed for this PR.

@kuba-- I will test it once I have it all changed to runes.

kuba-- commented 4 years ago

@ncordon - check CI

ncordon commented 4 years ago

@ncordon - check CI

The CI is failing because of the API limit @kuba--

ncordon commented 4 years ago

coverage.txt, line 2 at r4 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Please delete this file. It's probably from a code coverage tool.

Done!

ncordon commented 4 years ago

go.mod, line 24 at r4 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Unrelated change. Is it because you used pprof for debugging, or it's from `go mod tidy` with no other changes?

I have not used this. Probably it has been go mod tidy

ncordon commented 4 years ago

uast/transformer/semantic_test.go, line 44 at r4 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
`"/*\t\t\u0020 ½ */"`? Also, isn't `\u0020` a space?

Changed!. Yes, it is, I took it from the unicode.IsSpacedocumentation

ncordon commented 4 years ago

uast/transformer/semantic.go, line 149 at r2 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Related to the prefix issue: here we accept any Unicode character of class "space" to be a part of the prefix, so we can match some multi-byte ones as well.

Addressed

ncordon commented 4 years ago

uast/transformer/semantic.go, line 181 at r3 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Yes, if you check `strings.Index`, it will return `-1` if the index was not found. Also please remove `;`.

Changed, but I am not comfortable with this change. It obscures code IMO

ncordon commented 4 years ago

uast/transformer/semantic.go, line 181 at r3 (raw file):

Previously, ncordon (Nacho Cordón) wrote…
Changed, but I am not comfortable with this change. It obscures code IMO

The -1 thing I mean, not the ;

ncordon commented 4 years ago

uast/transformer/semantic.go, line 186 at r3 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
@ncordon you are right in the sense that we will end up converting back and forth between `string` and `[]rune`. Unfortunately there is no helper in stdlib that works on an array of runes. But, I also see that the result of the `firstIndexFunc` and `lastIndexFunc` function is used to split the string, not an array of runes. So @kuba-- has a point here: instead of returning a rune offset, you can calculate the byte offset in the string, and then split the string using it.

Please note that firstIndexFunc is also used to compute the tab. And that is made in [] rune not bytes

ncordon commented 4 years ago

uast/transformer/semantic.go, line 188 at r4 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Please remove parentheses.

Done!

ncordon commented 4 years ago

uast/transformer/semantic.go, line 220 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Please remove parentheses.

Done!

ncordon commented 4 years ago

uast/transformer/semantic.go, line 197 at r3 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Or index in bytes, as @kuba-- mentioned in the previous comment. @ncordon should address your concern, right?

Done!

ncordon commented 4 years ago

uast/transformer/semantic.go, line 186 at r3 (raw file):

Previously, kuba-- (Kuba Podgórski) wrote…
My intention was - `firstIndexFunc` as an independent thingy should follow some standard and return -1. BUT in our case it's used in one place and having `-1` as an output is not handy. So what I suggested in comment(s) bellow is to return a `[]rune` (instead of an index). In other words, having ``` func firstSliceFunc(runes []rune, f func(r rune) bool) []rune { var i int for r := range runes { if (f(r)) { return runes[:i+1] } i++ } return runes[:i] ``` So, in a use case instead of doing: ``` i := firstIndex s = s[:i] ``` you can just: ``` s = firstSliceFunc(s, f) ``` and similar approach for `last...` Do you know what I mean?

Changed now! Thanks, I was not understading correctly what you meant

ncordon commented 4 years ago

uast/transformer/semantic.go, line 242 at r2 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Related to the prefix issue: here we check the `tab` which may contain multi-byte characters, but we compare bytes instead of runes.

Addressed with the changes to []rune

ncordon commented 4 years ago

uast/transformer/semantic_test.go, line 44 at r4 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
I mean, the `0x20` is an ordinary space. You probably wanted something else, like any other kind of Unicode space, correct?

Changed now

ncordon commented 4 years ago

uast/transformer/semantic.go, line 177 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Please remove parentheses.

Done!

ncordon commented 4 years ago

go.mod, line 24 at r4 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Can you please also check that you don't have any temporary files? `go mod tidy` processes all files in the directory, not only ones committed to Git.

Done! Sorry, I do not know what happened there. I was using make test-coverage which generated the coverage.txt, but once I deleted it and did go mod tidy again these things disappeared

ncordon commented 4 years ago

driver/manifest/discovery/discovery_test.go, line 41 at r8 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Can you please rebase? We made a proper fix for this.

Done!

ncordon commented 4 years ago

uast/transformer/semantic.go, line 147 at r8 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
In Go, comments must begin with a function name: `isTabToken check if a token is a tab. A tab is defined ...`. Same for other comments.

Done!

ncordon commented 4 years ago

uast/transformer/semantic.go, line 189 at r8 (raw file):

Previously, kuba-- (Kuba Podgórski) wrote…
For sure.. - I suggest to `for i := len(runes) - 1; i > 0; i--`

Done!

ncordon commented 4 years ago

uast/transformer/semantic.go, line 247 at r8 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
```golang var tab []rune ```

Done!

ncordon commented 4 years ago

uast/transformer/semantic.go, line 251 at r8 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
We shouldn't trim tab from the first line - it's considered a `Prefix`.

True, fixed

ncordon commented 4 years ago

uast/transformer/semantic.go, line 198 at r8 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Can we reuse it? Both `findPrefix` and `findSuffix` convert the whole string this way. Either consider using `utf8.DecodeRuneInString`/`utf8.DecodeLastRuneInString` or somehow share the slice that is converted to runes.

I have changed it so that findPrefix and findSuffix reuse it and also the main body of the function, but certainly I would have preferred to keep the former version because of readability of the main algorithm of Split. We can do what you consider best, but my opinion is that what you save from those 3 conversions is not worth obscuring the main body of the Split method, and the algorithm is still linear in the length of the string

ncordon commented 4 years ago

driver/manifest/discovery/discovery_test.go, line 41 at r8 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Hmm, I still see this line in the PR, and I don't see the fix from the master here.

I rebased now, sorry about that, I forgot to do it

ncordon commented 4 years ago

uast/transformer/semantic.go, line 235 at r10 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Please remove parentheses

Done, I keep forgetting this because of C++ habits

ncordon commented 4 years ago

uast/transformer/semantic.go, line 266 at r10 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
Please remove parentheses

Done

ncordon commented 4 years ago

uast/transformer/semantic.go, line 289 at r10 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
This is very slow and creates a lot of garbage! Please revert to `strings.Join`.

I have used a strings.Builder instead

ncordon commented 4 years ago

uast/transformer/semantic.go, line 198 at r8 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
In that case you should use `DecodeRune*`, as I suggested ;)

I thought about that Denys, but first it adds a lot of extra work and second there is no way to get this more efficiently (converting and only one time to runes, without conversions back and forth between runes and strings). Using DecodeRune* was an option but I have to convert to []rune anyway to strip the tabs as was requested, so I prefer to do all the code with []rune and not a part with strings and another with []rune

ncordon commented 4 years ago

uast/transformer/semantic.go, line 198 at r8 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
But you can still trim tabs with `DecodeRune`. Am I missing something? This function in general takes a string and splits it into N parts (prefix, tab, lines, suffix). So it's definitely possible to use an original `string` value and slice it. This means that it's possible to implement it without any allocations by using `DecodeRune`. This won't require you to pass the `[]rune` slice around. So do you want to try the `DecodeRune` approach, or should I review the current one?

Review the current one :pray: . This has been opened for a month and it is not such a big deal of fix IMHO

ncordon commented 4 years ago

uast/transformer/semantic.go, line 277 at r11 (raw file):

Previously, dennwc (Denys Smirnov) wrote…
The last bit that may affect performance: please move this line to the `if len(sub) == 0 {` below, so we won't create a large string that will become garbage immediately.

I have changed it there

ncordon commented 4 years ago

uast/transformer/semantic.go, line 278 at r12 (raw file):

  var tab []rune
  // fast path, no tabs
  if len(sub) == 0 {

@dennwc I have also add it here since if we return here it means there is no common tab and we need to set the text and the tab from the comment