boyter / scc

Sloc, Cloc and Code: scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go
MIT License
6.27k stars 250 forks source link

BUG: Multicharacter support #466

Open boyter opened 1 month ago

boyter commented 1 month ago

Describe the bug

The addition of wenyan language via https://github.com/boyter/scc/pull/465 highlights an issue in how scc matches. If you run it against any wenyan file it will not count the complexity. This is down to how the scc state machine works where it is matching on characters not runes.

To Reproduce

go run . ./examples/language/wenyan.wy

from inside the scc directory on a recent checkout.

Expected behavior

It is expected that there will be some complexity counted here.

dbaggerman commented 1 month ago

It's been a while since I touched the code, but from what I remember the matching was done on bytes rather than characters. It should be able to match unicode byte sequences just as well as ASCII ones.

However, that relies on both languages.json and the source code using the same encoding. If they're both utf8 encoded with the same byte sequence then in theory it should be able to match them. On the other hand, if one is utf8 and the other is utf16 the underlying binary representation won't match so a byte sequence comparison will fail.

boyter commented 1 month ago

Yep you are correct. That's the exact reason for it.

Where I am debating is if it should be fixed in the state machine OR there be the ability to add new state machines for individual languages, which is something I wanted anyway in order to resolve some annoying issues, and as a potential performance improvement, for common languages such as Java as they could get their own specific code path.

dbaggerman commented 1 month ago

I remember seeing comments about the idea of language specific state machines to support the case of having one language nested within another.

The current state machine is built around looking up the bytes against a fixed size 256 element array, which wouldn't adapt well to converting to unicode runes. The trees are probably not dense enough to justify a binary search within a node, so doing a linear search within each node might be the most obvious solution.

Another solution would be to use the current Trie implementation, but re-encode the language tokens into a Trie per encoding. It would probably require some refactoring to track a Trie or LanguageFeature per encoding, but might perform better than a linear search.

boyter commented 1 month ago

The idea about the language specific state machine was yes to support that idea, as well as skip a heap of logic that does not apply where possible. For example java does not supported nested multiline comments, so skip the bookkeeping for that and hopefully gain a performance improvement. I would only be looking to do this for the most common languages though, as the generic version is much better for dealing with ad-hoc addition. It also would allow for dealing with any really oddball languages that pop up.

My initial plan is to just see what affect some rune conversion has as a baseline and go from there.