freznicek / awk-crashcourse

AWK language course
24 stars 2 forks source link

byte counter is word counter . awk #3

Open mogando668 opened 3 years ago

mogando668 commented 3 years ago

hi,

just a very minor comment -

summing up length($0) only works if the input is guaranteed to be ASCII-only.

i accidentally discovered that, even if gawk unicode mode, to get an exact byte count, for UTF8 inputs or even purely binary files like a .gz or a .mp4, a simple

match($0, /$/) - 1

does the trick. the minus 1 is needed since it matches the first available position, which is immediately after the input itself.

Conversely, if one definitely knows RT is a fixed-width of 1 byte (e.g. only \n ), then a byte count is even simpler -

at each row, add up

byte_cnt += match($0, /$/)

then at END { } section, byte_cnt will be accurate. In byte/POSIX/C mode, match( ) doesn't offer any speed up, so for those, use length( ) instead.


% time ( pvE0 < "${m3r}" | gawk -e 'BEGIN { FS=RS="^$" } END { print match($0,/$/) - 1 }' | ecp); echo

      in0:  408MiB 0:00:00 [1011MiB/s] [1011MiB/s] [===================================>] 100%            
428814321

( pvE 0.1 in0 < "${m3r}" | gawk -e  | mawk ; )  13.25s user 0.71s system 100% cpu 13.865 total

% time ( pvE0 < "${m3r}" | gawk -b -e 'BEGIN { FS=RS="^$" } END { print  match($0,/$/) - 1 }' | ecp); echo

      in0:  408MiB 0:00:00 [1.13GiB/s] [1.13GiB/s] [===================================>] 100%            
428814321

( pvE 0.1 in0 < "${m3r}" | gawk -b -e  | mawk ; )  13.47s user 0.66s system 100% cpu 14.042 total

time ( pvE0 < "${m3r}" | gawk -b -e 'BEGIN { FS=RS="^$" } END { print length }' | ecp); echo

      in0:  408MiB 0:00:00 [1.15GiB/s] [1.15GiB/s] [===================================>] 100%            
428814321

( pvE 0.1 in0 < "${m3r}" | gawk -b -e  | mawk ; )  0.28s user 0.67s system 115% cpu 0.825 total

one can obtain a tiny speed-up summing row-by-row instead of all at once , while for mawk2, theirs is implemented in a manner such that match-only is hardly any slow down on small inputs:

 time ( pvE0 < "${m3r}" | gawk -e 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print  byte_cnt }' | ecp); echo

      in0:  408MiB 0:00:13 [30.3MiB/s] [30.3MiB/s] [===================================>] 100%            
428814321

( pvE 0.1 in0 < "${m3r}" | gawk -e  | mawk ; )  13.49s user 0.28s system 101% cpu 13.553 total
 time ( pvE0 < "${m3r}" | mawk2  'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print  byte_cnt }' | ecp); echo

      in0:  408MiB 0:00:00 [1.47GiB/s] [1.47GiB/s] [===================================>] 100%            
428814321

( pvE 0.1 in0 < "${m3r}" | mawk2  | mawk ; )  0.11s user 0.28s system 124% cpu 0.310 total

 time ( pvE0 < "${m3r}" | mawk2  'BEGIN { FS="^$" } { byte_cnt += length($0) } END { print  byte_cnt+NR }' | ecp); echo

      in0:  408MiB 0:00:00 [1.50GiB/s] [1.50GiB/s] [===================================>] 100%            
428814321

( pvE 0.1 in0 < "${m3r}" | mawk2  | mawk ; )  0.10s user 0.27s system 124% cpu 0.300 total

here, i've thrown in a 224MB .7z binary file, and gawk does it just fine without any error messages (i've also added the gnu-wc output for reference) :

 f='./MV82_ConsolidatedDesktop/new_m3t_need_append.txt.7z'; gwc -lcm "${f}" | lgp3; time ( pvE0 < "${f}" | gawk -e 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print  byte_cnt - (RT=="") }' | ecp); echo

   920308 125659415 235672582 ./MV82_ConsolidatedDesktop/new_m3t_need_append.txt.7z

      in0:  224MiB 0:00:07 [28.6MiB/s] [28.6MiB/s] [===================================>] 100%            
235672582

( pvE 0.1 in0 < "${f}" | gawk -e  | mawk ; )  7.83s user 0.22s system 101% cpu 7.892 total
freznicek commented 3 years ago

Thank you for detailed analysis, i'm aware of this mismatch. I'm going to find simple fix or document current behavior.