V Corpus - Githubissues

jfioasd commented 1 year ago

Similar to this issue, I decided to run Lynn's method on V answers.

import csv
import collections

digraphs = collections.Counter()
trigraphs = collections.Counter()
quadgraphs = collections.Counter()

cp1252 = "ǝʒαβγδεζηθ\nвимнтΓΔΘιΣΩ≠∊∍∞₁₂₃₄₅₆ !\"#$%&'()*+,-./0123456789" + \
                                 ":;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrst" + \
                                 "uvwxyz{|}~Ƶ€Λ‚ƒ„…†‡ˆ‰Š‹ŒĆŽƶĀ‘’“”•–—˜™š›œćžŸā¡¢£¤¥¦§¨©ª«¬λ®¯°" + \
                                 "±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëì" + \
                                 "íîïðñòóôõö÷øùúûüýþÿ"
with open("QueryResults(2).csv", newline="", encoding="utf-8") as f:
    for row in csv.reader(f):
        if row[0] == "Post Link":
            continue
        code = row[1]
        if "<pre><code>" not in code:
            continue

        # Extract the first bit of code
        vyxal = (
            code.partition("<pre><code>")[2]
            .partition("</code></pre>")[0]
            .strip()
        )
        vyxal = vyxal.replace("&quot;", '"')
        vyxal = vyxal.replace("&gt;", ">").replace("&lt;", "<")
        vyxal = vyxal.replace("&amp;", "&")

        for i in range(0, 256):
            vyxal = vyxal.replace("&#"+str(i)+";", cp1252[i])
        vyxal = vyxal.replace("<esc>", cp1252[0x1b])
        alpha = "abcdefghijklmnopqrstuvwxyz"
        for idx, i in enumerate(alpha):
            vyxal = vyxal.replace("<C-"+i+">", cp1252[idx+1])

        vyxal = vyxal.replace("<M-x>", "ø")

        if any(vyxal.count(c) >= 10 for c in vyxal):
            continue
        if len(vyxal) > 100:
            continue

        for line in vyxal.split("\n"):
            for (a, b) in zip(line, line[1:]):
                digraphs[a, b] += 1
            for (a, b, c) in zip(line, line[1:], line[2:]):
                trigraphs[a, b, c] += 1
            for (a, b, c, d) in zip(line, line[1:], line[2:], line[3:]):
                quadgraphs[a, b, c, d] += 1

with open("most-common.txt", "w", encoding="utf-8") as f:
    f.write("2-graphs:\n")
    for d, n in digraphs.most_common(30):
        f.write("%4d %s\n" % (n, "".join(d)))

    f.write("\n3-graphs:\n")
    for d, n in trigraphs.most_common(30):
        f.write("%4d %s\n" % (n, "".join(d)))

    f.write("\n4-graphs:\n")
    for d, n in quadgraphs.most_common(30):
        f.write("%4d %s\n" % (n, "".join(d)))

Results (displayed in the 05AB1E codepage):

2-graphs:
  24 Àñ
  24 ./
  21 Àé
  21 @"
  19 $x
  17 xx
  17  /
  17   
  15 /&
  15 dd
  15 «©
  14 Íî
  14 òÍ
  14 12
  13 2i
  13 lD
  13 Gp
  13 ll
  13 é 
  12 Θ"
  12 /d
  12 / 
  12 Yp
  11 r 
  11 lx
  11 e 
  11  ₂
  11 Ó.
  11 òd
  10 kl

3-graphs:
  11 ./&
  10 Ó./
   8 [ae
   8 aei
   8 eio
   8 iou
   8 "qp
   7 YGp
   7 /&ò
   7 lxx
   7 $xh
   7 D@"
   6 Í./
   6 xx>
   6 Ä$x
   6 qpx
   6 Àé 
   5 Àé*
   5 òÍ¨
   5 Àñ
   5 ou]
   5 ¨ä«
   4 ©î±
   4 ¨[a
   4 ]«©
   4 «©¨
   4 «©/
   4 òÍî
   4 /  
   4 /12

4-graphs:
   8 [aei
   8 aeio
   8 eiou
   8 Ó./&
   6 ./&ò
   6 "qpx
   5 iou]
   4 ¨[ae
   4 òÄ$x
   4 Ä$xh
   4 ~"qp
   4 :se 
   4 2i2i
   4 ¨ä«©
   3 Í./&
   3 ¨.«©
   3 lxx>
   3 iouy
   3 ouy]
   3 uy]«
   3 À|lD
   3 Ñ~"q
   3 ./& 
   3 òhYp
   3 hYpX
   3 :sor
   3 éiD@
   3 iD@"
   3 ₂"qp
   3 gÓul

DJMcMayhem commented 1 year ago

I think there's something in your parsing that is a major oversight. &# is not really meaningful V code (It's valid, but not exactly 'useful') so there's absolutely no way that's the most common 2-byte sequence in V code. Look at this answer for example: https://codegolf.stackexchange.com/a/124772/31716

The markdown for that answer is

<pre><code>Í.&#147;op
</code></pre>

which renders on SE as

Í.“op

Also <C-x> means "ctrl-x", but it gets treated like it's 5 distinct bytes instead of 1 by this parser, which is why <C, C-, and esc all score so high. It seems like this parser isn't sophisticated enough to handle the way V answers tend to be formatted.

jfioasd commented 1 year ago

Thanks, I thought V was an SBCS. I'll try to parse <...> and &#...; in my analyzer.

Apparently, SE uses CP-1252, so I'll use the 05AB1E codepage to display it. (replacing these sequences into the respective characters)

DJMcMayhem commented 1 year ago

It is an SBCS, it's just the answers are frequently formatted in "readable mode" with things like <C-a>, <esc>, <M-D>, etc. That's one additional thing that would need to be parsed, <M-x> means "alt-x" which would mean 'x' with the high bit set in latin9, or ø.

jfioasd commented 1 year ago

Done

DJMcMayhem / V

V Corpus #27