NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
819 stars 230 forks source link

Align print format string for column names and units #126

Open dmitrygx opened 1 year ago

dmitrygx commented 1 year ago

Format for column names and units started to be mismatched after changing "error" to "#wrong" in NCCL 2.13 (merged to master as a part of https://github.com/NVIDIA/nccl-tests/commit/51af5572bf8ebf197bac7de8cd6bc7d847339575).

AddyLaddy commented 1 year ago

I can't see any misalignment in my tests:

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1073741824     268435456     float     sum      -1   8160.7  131.58  230.26      0   8158.3  131.61  230.32      0
  2147483648     536870912     float     sum      -1    16191  132.64  232.11      0    16192  132.63  232.10      0
  4294967296    1073741824     float     sum      -1    32115  133.74  234.04      0    32106  133.77  234.10      0
  8589934592    2147483648     float     sum      -1    64086  134.04  234.57      0    64070  134.07  234.62      0
 17179869184    4294967296     float     sum      -1   128070  134.14  234.75      0   128042  134.17  234.80      0
sjeaugey commented 1 year ago

That's because you don't have any data corruption. In the case of corruption, we could have up to 6 chars I guess hence the change.

If that is truly the idea, I agree it would be better. It's annoying to widen the line for something that should never happen, but if it happened, it would break any automatic parsing, including parsing supposed to report errors.

dmitrygx commented 1 year ago

this change aligns format string:

to make them identical: "# %10s %12s %8s %6s %6s %7s %6s %6s %5s %7s %6s %6s %5s\n" and have the same width for #wrong column in units and measurements printed: https://github.com/NVIDIA/nccl-tests/blob/2cbb968101e2bfc7d3a7f0f1826c0189355de6fe/src/common.cu#L553 and https://github.com/NVIDIA/nccl-tests/blob/2cbb968101e2bfc7d3a7f0f1826c0189355de6fe/src/common.cu#L555

dmitrygx commented 1 year ago

@AddyLaddy @sjeaugey could you review pls?

sjeaugey commented 1 year ago

The reason we replaced %5s by %6s (removing one space and adding one char to the column name) was because we needed that extra char given #wrong is 6 chars. If you unify everything on %5s then #wrong will be too large and cause a shift of the columns by one char.

So I'm still not sure what problem we're trying to solve here.