ezrosent / frawk

an efficient awk-like language
Apache License 2.0
1.26k stars 36 forks source link

Can the executable size be made smaller #113

Open 9ao9ai9ar opened 1 month ago

9ao9ai9ar commented 1 month ago
$ ls -lHSh /usr/bin/{awk,grep,mawk,rg} ~/.local/bin/frawk
-rwxr-xr-x. 1 user user  43M Sep 24 09:19 /home/user/.local/bin/frawk
-rwxr-xr-x. 1 root root 4.3M May 23 08:00 /usr/bin/rg
-rwxr-xr-x. 1 root root 747K Jan 24  2024 /usr/bin/awk
-rwxr-xr-x. 1 root root 175K Jan 21  2024 /usr/bin/mawk
-rwxr-xr-x. 1 root root 167K Jan 24  2024 /usr/bin/grep

frawk, when fully stripped, is still an order of magnitude larger than ripgrep, a similar Rust CLI program that is already an order of magnitude larger than traditional UNIX CLI tools. The large size gives the impression that the program is bloated, even more so when my benchmark shows that it is slower than mawk by some margin.

ghuls commented 1 month ago

Compiling without llvm probably makes it a lot smaller. My frawk binary is 9.4M stripped and 17M unstripped.

I rarely see mawk being much faster than frawk (except for https://github.com/ezrosent/frawk/issues/98). Although I guess it might be possible if you test on small files (as the overhead of compiling the awk script in case of frawk might be most of the runtime). Do you have some example scripts?

9ao9ai9ar commented 1 month ago

How to build frawk without LLVM?

I've used frawk with all optimization levels and backends in my scripts in this repo, but none of the combinations appear to be noticeably faster with frawk compared to mawk, on both modern SSD running Fedora 40 and obsolescent HDD running Ubuntu 22.04 in WSL. (In fact, they're generally more noticeably slower with frawk than faster.)

I wonder if you could replicate my observations @ghuls? To test the scripts utilizing frawk, you could either define a function mawk in benchmark.sh that calls frawk to make an in-place override (nothing is run in a loop, so the function lookup cost is negligible), or make a copy of each script in solutions and modify it to use frawk instead of mawk/awk (except with the Toby's enhancement versions you need to change the syntax — if I recall correctly, replacing && with something else — since it's incompatible with frawk).

ghuls commented 1 month ago
# Without LLVM, but with other recommended defaults
$ cargo +nightly install --path . --no-default-features --features use_jemalloc,allow_avx2,unstable
ghuls commented 1 week ago

The awk scripts that don't work, need to be reformatted (frawk parser does not accept all valid awk script yet):

-    mawk '{ ++count[$0] }
-          END {
-              for (i in count) split(i, r, "[-\"]") && a[r[2]] += count[i];
-              for (i in a) total += a[i];
-              print +total, +a["1"], +a["0"], +a["1/2"];
-          }'
-
+    frawk '{
+        ++count[$0]
+    }
+    END {
+        for (i in count) {
+            split(i, r, "[-\"]") && (a[r[2]] += count[i])
+        }
+        for (i in a) {
+            total += a[i]
+        }
+        print +total, +a["1"], +a["0"], +a["1/2"]
+    }'

The easiest way to reformat an awk script to one that frawk accepts, is to use gawk --pretty-print:

gawk --pretty-print=/dev/stdout '{ ++count[$0] }
          END {
               for (i in count) split(i, r, "[-\"]") && a[r[2]] += count[i];
               for (i in a) total += a[i];
               print +total, +a["1"], +a["0"], +a["1/2"];
           }'
{
    ++count[$0]
}

END {
    for (i in count) {
        split(i, r, "[-\"]") && (a[r[2]] += count[i])
    }
    for (i in a) {
        total += a[i]
    }
    print +total, +a["1"], +a["0"], +a["1/2"]
}

Indeed for your repo, frawk seems slower than mawk.

I assume it is because of the irrigularities between different lines, that might trow of some of frawks optimizations.

$ head -n 59 chess_data_master.head5000000.txt 
[Event "Hastings 1972/73"]
[Site "Hastings ENG"]
[Date "1973.01.07"]
[EventDate "1972.12.27"]
[Round "11"]
[Result "0-1"]
[White "Vasily Smyslov"]
[Black "Walter Shawn Browne"]
[ECO "A45"]
[WhiteElo "?"]
[BlackElo "?"]
[PlyCount "66"]

1.d4 Nf6 2.g3 c5 3.dxc5 Na6 4.Bg2 Nxc5 5.c4 g6 6.Nc3 Bg7 7.Nh3
O-O 8.O-O d6 9.Nf4 a6 10.Be3 Rb8 11.Rc1 Ng4 12.Bd2 b5 13.cxb5
axb5 14.b4 Ne6 15.Nfd5 Ne5 16.Nxb5 Ba6 17.Nbc7 Nxc7 18.Rxc7 e6
19.Ra7 Bb5 20.Nc7 Bc6 21.Qc2 Rb6 22.b5 Qb8 23.Ra3 Bxg2 24.Kxg2
Rc8 25.Rc1 Rxb5 26.Nxb5 Rxc2 27.Rxc2 Qxb5 28.Rc8+ Bf8 29.Raa8
Kg7 30.Bc3 Be7 31.f4 Qxe2+ 32.Kg1 Kh6 33.fxe5 Bg5 0-1[Date "1901.11.23"]
[EventDate "1901.11.17"]
[Round "4"]
[Result "0-1"]
[White "Juan Corzo"]
[Black "Jose Raul Capablanca"]
[ECO "C49"]
[WhiteElo "?"]
[BlackElo "?"]
[PlyCount "136"]

1.e4 e5 2.Nf3 Nc6 3.Nc3 Nf6 4.Bb5 Bb4 5.O-O O-O 6.d3 d6 7.Bg5
Ne7 8.Ne2 Ng6 9.c3 Ba5 10.Ng3 h6 11.Bxf6 Qxf6 12.Nh5 Qe7 13.h3
c6 14.Bc4 Be6 15.Nd2 Qg5 16.Bxe6 fxe6 17.Qg4 Qxg4 18.hxg4 Bc7
19.Nf3 Rae8 20.g5 hxg5 21.Nxg5 Nf4 22.Nxf4 exf4 23.f3 e5
24.Rad1 c5 25.Kf2 Bd8 26.Nh3 b5 27.Ke2 Bb6 28.Rh1 Rf6 29.Rh2
Rh6 30.Rdh1 Ree6 31.Nf2 Rxh2 32.Rxh2 Rh6 33.Rxh6 gxh6 34.Nh3
Kg7 35.c4 bxc4 36.dxc4 Kg6 37.Kf2 Ba5 38.Ke2 Kh5 39.Nf2 Kg5
40.a3 h5 41.Nd3 Kh4 42.b4 Bb6 43.Nb2 Kg3 44.Kf1 h4 45.Na4 cxb4
46.axb4 h3 47.gxh3 Kxf3 48.c5 dxc5 49.Nxc5 Kg3 50.Nd3 Bd4
51.b5 Kxh3 52.Ke2 Kg3 53.Ne1 Kg4 54.Nf3 Bc3 55.Kf2 Bd4+ 56.Ke2
Kg3 57.Ne1 Ba1 58.Nf3 Bc3 59.Ng5 f3+ 60.Nxf3 Kf4 61.Kf2 Kxe4
62.Ng5+ Kd3 63.Kf3 Kc4 64.Ne4 Bd4 65.Nd6+ Kc5 66.Nc8 Kxb5
67.Ke4 a5 68.Nd6+ Kb4 0-1
[Event "wcc"]
[Site "Elista"]
[Date "1996.06.06"]
[Round "01"]
[White "Karpov A"]
[Black "Kamsky G"]
[Result "1-0"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. Nf3 Bg7 5. Qb3 dxc4 6. Qxc4 O-O 7. e4 Nc6
8. Be2 Bg4 9. Be3 Bxf3 10. Bxf3 e5 11. d5 Nd4 12. Bd1 b5 13. Nxb5 Nxe4 14.
O-O a6 15. Nc3 Nd6 16. Qd3 Qh4 17. g3 Qh3 18. Bxd4 exd4 19. Ne2 Qf5 20. Nf4
Rfb8 21. Qxf5 Nxf5 22. Nd3 Bh6 23. Re1 a5 24. Bg4 Nd6 25. Re2 a4 26. a3 Ra5
27. Rc2 Rxd5 28. Rxc7 Ra5 29. Bf3 Bg5 30. Rd1 Rc8 31. Rxc8+ Nxc8 32. h4 Bf6
33. Rc1 Nd6 34. Kf1 Be7 35. Ke2 Kf8 36. Rc7 Bf6 37. Kd2 h5 38. Ke2 Nf5 39.
Rc4 Nd6 40. Rb4 Ra6 41. Nc5 Ra7 42. Kd3 Rc7 43. Nxa4 Rc1 44. Nb6 Bg7 45. a4
Ra1 46. Nd7+ Ke8 47. Nc5 Ke7 48. Kc2 Rf1 49. Nd3 Ra1 50. Kb3 f5 51. Rb6 Bh6
52. Bd5 g5 53. Ra6 gxh4 54. gxh4 Rd1 55. Bc4 Rh1 56. a5 Rxh4 1-0

frawk seems to split eagerly on fields even if they are not used.

Setting field separator and just counting number of lines.

# mawk
❯ timeit mawk -F '"|-' '{ i++ } END { print i }' chess_data_master.head5000000.txt
50000000

Time output:
------------

  * Command: mawk -F "|- { i++ } END { print i } chess_data_master.head5000000.txt
  * Elapsed wall time: 0:01.13 = 1.13 seconds
  * Elapsed CPU time:
     - User: 1.03
     - Sys: 0.09
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 1
     - Involuntarily (time slice expired): 291
  * Maximum resident set size (RSS: memory) (kiB): 2744
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

# frawk
❯ timeit frawk -F '"|-' '{ i++ } END { print i }' chess_data_master.head5000000.txt
50000000

Time output:
------------

  * Command: frawk -F "|- { i++ } END { print i } chess_data_master.head5000000.txt
  * Elapsed wall time: 0:03.50 = 3.50 seconds
  * Elapsed CPU time:
     - User: 3.33
     - Sys: 0.16
  * CPU usage: 100%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 3
     - Involuntarily (time slice expired): 40
  * Maximum resident set size (RSS: memory) (kiB): 9768
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

# Setting field separator in frawk to a non common symbol seemingly avoids this overhead: (newline)
❯ timeit frawk -F '\n' '{ i++ } END { print i }' chess_data_master.head5000000.txt
50000000

Time output:
------------

  * Command: frawk -F \n { i++ } END { print i } chess_data_master.head5000000.txt
  * Elapsed wall time: 0:01.80 = 1.80 seconds
  * Elapsed CPU time:
     - User: 1.63
     - Sys: 0.16
  * CPU usage: 100%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 9
  * Maximum resident set size (RSS: memory) (kiB): 9808
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

# Setting field separator in frawk to a non common symbol seemingly avoids this overhead: (tab)
$ timeit frawk -F '\t' '{ i++ } END { print i }' chess_data_master.head5000000.txt
50000000

Time output:
------------

  * Command: frawk -F \t { i++ } END { print i } chess_data_master.head5000000.txt
  * Elapsed wall time: 0:01.97 = 1.97 seconds
  * Elapsed CPU time:
     - User: 1.77
     - Sys: 0.19
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 4
     - Involuntarily (time slice expired): 70
  * Maximum resident set size (RSS: memory) (kiB): 9716
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

Setting field separator and just counting number of lines that contain "Result".

# mawk
❯ timeit mawk -F '"|-' '/Result/{ i++; } END { print i++ }' chess_data_master.head5000000.txt 
2100248

Time output:
------------

  * Command: mawk -F "|- /Result/{ i++; } END { print i++ } chess_data_master.head5000000.txt
  * Elapsed wall time: 0:02.47 = 2.47 seconds
  * Elapsed CPU time:
     - User: 2.28
     - Sys: 0.17
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 1
     - Involuntarily (time slice expired): 137
  * Maximum resident set size (RSS: memory) (kiB): 2864
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

# frawk
❯ timeit frawk -F '"|-' '/Result/{ i++; } END { print i++ }' chess_data_master.head5000000.txt 
2100248

Time output:
------------

  * Command: frawk -F  |" /Result/{ i++; } END { print i++ } chess_data_master.head5000000.txt
  * Elapsed wall time: 0:05.53 = 5.53 seconds
  * Elapsed CPU time:
     - User: 5.38
     - Sys: 0.14
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 6
     - Involuntarily (time slice expired): 51
  * Maximum resident set size (RSS: memory) (kiB): 9900
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

# Set newline as field separator in frawk. Some speed is regained, but matching Result is slower than mawk
❯ timeit frawk -F '\n' '/Result/{ i++; } END { print i++ }' chess_data_master.head5000000.txt 
2100248

Time output:
------------

  * Command: frawk -F \n /Result/{ i++; } END { print i++ } chess_data_master.head5000000.txt
  * Elapsed wall time: 0:04.01 = 4.01 seconds
  * Elapsed CPU time:
     - User: 3.82
     - Sys: 0.18
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 6
     - Involuntarily (time slice expired): 10
  * Maximum resident set size (RSS: memory) (kiB): 9864
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0
9ao9ai9ar commented 1 week ago

Didn't know about --pretty-print, that's gonna be so useful!