Open jfioasd opened 1 year ago
I think there's something in your parsing that is a major oversight. &#
is not really meaningful V code (It's valid, but not exactly 'useful') so there's absolutely no way that's the most common 2-byte sequence in V code. Look at this answer for example: https://codegolf.stackexchange.com/a/124772/31716
The markdown for that answer is
<pre><code>Í.“op
</code></pre>
which renders on SE as
Í.op
Also <C-x>
means "ctrl-x", but it gets treated like it's 5 distinct bytes instead of 1 by this parser, which is why <C
, C-
, and esc
all score so high. It seems like this parser isn't sophisticated enough to handle the way V answers tend to be formatted.
Thanks, I thought V was an SBCS. I'll try to parse <...>
and &#...;
in my analyzer.
Apparently, SE uses CP-1252, so I'll use the 05AB1E codepage to display it. (replacing these sequences into the respective characters)
It is an SBCS, it's just the answers are frequently formatted in "readable mode" with things like <C-a>
, <esc>
, <M-D>
, etc. That's one additional thing that would need to be parsed, <M-x>
means "alt-x" which would mean 'x' with the high bit set in latin9, or ø
.
Done
Similar to this issue, I decided to run Lynn's method on V answers.
Query used. Code:
Results (displayed in the 05AB1E codepage):