summing up length($0) only works if the input is guaranteed to be ASCII-only.
i accidentally discovered that, even if gawk unicode mode, to get an exact byte count, for UTF8 inputs or even purely binary files like a .gz or a .mp4, a simple
match($0, /$/) - 1
does the trick. the minus 1 is needed since it matches the first available position, which is immediately after the input itself.
Conversely, if one definitely knows RT is a fixed-width of 1 byte (e.g. only \n ),
then a byte count is even simpler -
at each row, add up
byte_cnt += match($0, /$/)
then at END { } section, byte_cnt will be accurate. In byte/POSIX/C mode, match( ) doesn't offer any speed up, so for those, use length( ) instead.
% time ( pvE0 < "${m3r}" | gawk -e 'BEGIN { FS=RS="^$" } END { print match($0,/$/) - 1 }' | ecp); echo
in0: 408MiB 0:00:00 [1011MiB/s] [1011MiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | gawk -e | mawk ; ) 13.25s user 0.71s system 100% cpu 13.865 total
% time ( pvE0 < "${m3r}" | gawk -b -e 'BEGIN { FS=RS="^$" } END { print match($0,/$/) - 1 }' | ecp); echo
in0: 408MiB 0:00:00 [1.13GiB/s] [1.13GiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | gawk -b -e | mawk ; ) 13.47s user 0.66s system 100% cpu 14.042 total
time ( pvE0 < "${m3r}" | gawk -b -e 'BEGIN { FS=RS="^$" } END { print length }' | ecp); echo
in0: 408MiB 0:00:00 [1.15GiB/s] [1.15GiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | gawk -b -e | mawk ; ) 0.28s user 0.67s system 115% cpu 0.825 total
one can obtain a tiny speed-up summing row-by-row instead of all at once , while for mawk2, theirs is implemented in a manner such that match-only is hardly any slow down on small inputs:
time ( pvE0 < "${m3r}" | gawk -e 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print byte_cnt }' | ecp); echo
in0: 408MiB 0:00:13 [30.3MiB/s] [30.3MiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | gawk -e | mawk ; ) 13.49s user 0.28s system 101% cpu 13.553 total
time ( pvE0 < "${m3r}" | mawk2 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print byte_cnt }' | ecp); echo
in0: 408MiB 0:00:00 [1.47GiB/s] [1.47GiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | mawk2 | mawk ; ) 0.11s user 0.28s system 124% cpu 0.310 total
time ( pvE0 < "${m3r}" | mawk2 'BEGIN { FS="^$" } { byte_cnt += length($0) } END { print byte_cnt+NR }' | ecp); echo
in0: 408MiB 0:00:00 [1.50GiB/s] [1.50GiB/s] [===================================>] 100%
428814321
( pvE 0.1 in0 < "${m3r}" | mawk2 | mawk ; ) 0.10s user 0.27s system 124% cpu 0.300 total
here, i've thrown in a 224MB .7z binary file, and gawk does it just fine without any error messages (i've also added the gnu-wc output for reference) :
hi,
just a very minor comment -
summing up length($0) only works if the input is guaranteed to be ASCII-only.
i accidentally discovered that, even if gawk unicode mode, to get an exact byte count, for UTF8 inputs or even purely binary files like a .gz or a .mp4, a simple
match($0, /$/) - 1
does the trick. the minus 1 is needed since it matches the first available position, which is immediately after the input itself.
Conversely, if one definitely knows RT is a fixed-width of 1 byte (e.g. only \n ), then a byte count is even simpler -
at each row, add up
byte_cnt += match($0, /$/)
then at END { } section, byte_cnt will be accurate. In byte/POSIX/C mode, match( ) doesn't offer any speed up, so for those, use length( ) instead.
one can obtain a tiny speed-up summing row-by-row instead of all at once , while for mawk2, theirs is implemented in a manner such that match-only is hardly any slow down on small inputs:
here, i've thrown in a 224MB .7z binary file, and gawk does it just fine without any error messages (i've also added the gnu-wc output for reference) :