NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
866 stars 234 forks source link

getHostHash() maps to same hash value for different hostnames #60

Closed jithinjosepkl closed 3 years ago

jithinjosepkl commented 3 years ago

getHostHash() in src/common.h maps different hostnames to same hashvalue:

Simple repro:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

static uint64_t getHostHash(const char* string) {
  // Based on DJB2, result = result * 33 + char
  uint64_t result = 5381;
  for (int c = 0; string[c] != '\0'; c++){
    result = ((result << 5) + result) + string[c];
  }
  return result;
}

int main()
{
  char *host1 = "compute00000R";
  char *host2 = "compute000011";

  fprintf(stderr, "hash = 0x%" PRIx64 ", %s\n", getHostHash(host1), host1);
  fprintf(stderr, "hash = 0x%" PRIx64 ", %s\n", getHostHash(host2), host2);
  return 0;
}

jijos@jijos-pc2:~/workspace$ ./hash
hash = 0xc9f19cd3023f9904, compute00000R
hash = 0xc9f19cd3023f9904, compute000011

This will result in incorrect localRank calculation, and eventually result in 'invalid device ordinal' error in cudaGetDeviceProperties.

compute00000R: Test CUDA failure common.cu:732 'invalid device ordinal'
sjeaugey commented 3 years ago

Right, the hashing algorithm is extremely basic so forcing a collision is not hard. Is that something that actually happened or is it a thought experiment ?

In any case, we can certainly look for a more robust hash.

jithinjosepkl commented 3 years ago

Thanks @sjeaugey.

Is that something that actually happened or is it a thought experiment ?

Yes, this actually happened in a larger scale NCCL job.

AddyLaddy commented 3 years ago

Can you see if Bernstein's modified DJB2a works for you? i.e. replace the + with an xor;

result = ((result << 5) + result) ^ string[c];

jithinjosepkl commented 3 years ago

@AddyLaddy - yep, that avoids the collision, and the scale experiment is happy.

jithinjosepkl commented 3 years ago

Are you planning to invest on a more robust hash? Otherwise, I can go ahead and submit a PR to switch to modified DJB2a hash function.

sjeaugey commented 3 years ago

I was looking into that indeed, as to how much unlikely we were to have a collision with DJB2a relative to DJB2, and whether we were really unlucky in your case or it could happen on a regular basis. I couldn't find much to confirm that however.

We could use stronger hashes, like MD5 or SHA, but that would add a dependency to openssl.

Another solution would be to use the full hostname everywhere instead of a hash, but that means transmitting much more data (which might not be a big deal) and making comparisons and hostname manipulation in general much more complex.

jithinjosepkl commented 3 years ago

This collision is very easy to hit with Azure VMSS deployments. The modified DJB2a hash looks good so far, though we cannot really guarantee that this will not happen as we scale higher.

It would be great if we can switch to DJIB2a hash (at least as an interim solution) otherwise users may easily hit this error on Azure VMSS.

sjeaugey commented 3 years ago

Thanks for the PR @jithinjosepkl. I would think we need a fix in NCCL as well, right ? We use the same function to determine whether we are communicating intra-node or inter-node between ranks. Or is there something I'm missing ?

jithinjosepkl commented 3 years ago

I think it will be good to, but I haven't hit this issue with NCCL [to be precise, not yet ;)]. It could be because the boot_id is appended to hostname. (https://github.com/NVIDIA/nccl/blob/c6dbdb00849027b4e2c277653cbef53729f7213d/src/misc/utils.cc#L103)

sjeaugey commented 3 years ago

Ah. You're right. We no longer rely on the host hash (or at least not only) as we found the boot_id being much more reliable.

Thanks for reminding me of how my code works :smile:. I'll merge the PR.

abuccts commented 3 years ago

Hi @jithinjosepkl and @sjeaugey,

The issue still exists for Azure VMSS, even in a small scale when prefix varies.

Simple repo after using DJIB2a hash:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

static uint64_t getHostHash(const char* string) {
  // Based on DJB2a, result = result * 33 ^ char
  uint64_t result = 5381;
  for (int c = 0; string[c] != '\0'; c++){
    result = ((result << 5) + result) ^ string[c];
  }
  return result;
}

static char* getVmssSuffix(int id) {
  // Azure VMSS naming follows ${user-defined-prefix}${6-digit-base36-id} format
  static char base36[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  static char suffix[] = "000000";
  unsigned int pt = 6;
  do {
    suffix[--pt] = base36[id % 36];
  } while (id /= 36);
  return suffix;
}

int main(int argc, char *argv[]) {
  for (int i = 0; i < 100; i ++) {
    char hostname[50] = {0};
    sprintf(hostname, "%s%s", argv[1], getVmssSuffix(i));
    printf("hash = 0x%" PRIx64 ", hostname = %s\n", getHostHash(hostname), hostname);
  }
  return 0;
}

Then run with:

gcc hash.c -o hash
./hash x100- | sort
Here's the output. ``` hash = 0xbf76030e9e63ab00, hostname = x100-00000Q hash = 0xbf76030e9e63ab00, hostname = x100-000010 hash = 0xbf76030e9e63ab01, hostname = x100-00000P hash = 0xbf76030e9e63ab01, hostname = x100-000011 hash = 0xbf76030e9e63ab02, hostname = x100-00000S hash = 0xbf76030e9e63ab02, hostname = x100-000012 hash = 0xbf76030e9e63ab03, hostname = x100-00000R hash = 0xbf76030e9e63ab03, hostname = x100-000013 hash = 0xbf76030e9e63ab04, hostname = x100-00000U hash = 0xbf76030e9e63ab04, hostname = x100-000014 hash = 0xbf76030e9e63ab05, hostname = x100-00000T hash = 0xbf76030e9e63ab05, hostname = x100-000015 hash = 0xbf76030e9e63ab06, hostname = x100-00000W hash = 0xbf76030e9e63ab06, hostname = x100-000016 hash = 0xbf76030e9e63ab07, hostname = x100-00000V hash = 0xbf76030e9e63ab07, hostname = x100-000017 hash = 0xbf76030e9e63ab08, hostname = x100-00000Y hash = 0xbf76030e9e63ab08, hostname = x100-000018 hash = 0xbf76030e9e63ab09, hostname = x100-00000X hash = 0xbf76030e9e63ab09, hostname = x100-000019 hash = 0xbf76030e9e63ab0b, hostname = x100-00000Z hash = 0xbf76030e9e63ab10, hostname = x100-00000A hash = 0xbf76030e9e63ab12, hostname = x100-00000C hash = 0xbf76030e9e63ab13, hostname = x100-00000B hash = 0xbf76030e9e63ab14, hostname = x100-00000E hash = 0xbf76030e9e63ab15, hostname = x100-00000D hash = 0xbf76030e9e63ab16, hostname = x100-00000G hash = 0xbf76030e9e63ab17, hostname = x100-00000F hash = 0xbf76030e9e63ab18, hostname = x100-00000I hash = 0xbf76030e9e63ab19, hostname = x100-00000H hash = 0xbf76030e9e63ab1a, hostname = x100-00000K hash = 0xbf76030e9e63ab1b, hostname = x100-00000J hash = 0xbf76030e9e63ab1c, hostname = x100-00000M hash = 0xbf76030e9e63ab1d, hostname = x100-00000L hash = 0xbf76030e9e63ab1e, hostname = x100-00000O hash = 0xbf76030e9e63ab1f, hostname = x100-00000N hash = 0xbf76030e9e63ab60, hostname = x100-000001 hash = 0xbf76030e9e63ab60, hostname = x100-00001P hash = 0xbf76030e9e63ab61, hostname = x100-000000 hash = 0xbf76030e9e63ab61, hostname = x100-00001Q hash = 0xbf76030e9e63ab62, hostname = x100-000003 hash = 0xbf76030e9e63ab62, hostname = x100-00001R hash = 0xbf76030e9e63ab63, hostname = x100-000002 hash = 0xbf76030e9e63ab63, hostname = x100-00001S hash = 0xbf76030e9e63ab64, hostname = x100-000005 hash = 0xbf76030e9e63ab64, hostname = x100-00001T hash = 0xbf76030e9e63ab65, hostname = x100-000004 hash = 0xbf76030e9e63ab65, hostname = x100-00001U hash = 0xbf76030e9e63ab66, hostname = x100-000007 hash = 0xbf76030e9e63ab66, hostname = x100-00001V hash = 0xbf76030e9e63ab67, hostname = x100-000006 hash = 0xbf76030e9e63ab67, hostname = x100-00001W hash = 0xbf76030e9e63ab68, hostname = x100-000009 hash = 0xbf76030e9e63ab68, hostname = x100-00001X hash = 0xbf76030e9e63ab69, hostname = x100-000008 hash = 0xbf76030e9e63ab69, hostname = x100-00001Y hash = 0xbf76030e9e63ab6a, hostname = x100-00001Z hash = 0xbf76030e9e63ab71, hostname = x100-00001A hash = 0xbf76030e9e63ab72, hostname = x100-00001B hash = 0xbf76030e9e63ab73, hostname = x100-00001C hash = 0xbf76030e9e63ab74, hostname = x100-00001D hash = 0xbf76030e9e63ab75, hostname = x100-00001E hash = 0xbf76030e9e63ab76, hostname = x100-00001F hash = 0xbf76030e9e63ab77, hostname = x100-00001G hash = 0xbf76030e9e63ab78, hostname = x100-00001H hash = 0xbf76030e9e63ab79, hostname = x100-00001I hash = 0xbf76030e9e63ab7a, hostname = x100-00001J hash = 0xbf76030e9e63ab7b, hostname = x100-00001K hash = 0xbf76030e9e63ab7c, hostname = x100-00001L hash = 0xbf76030e9e63ab7d, hostname = x100-00001M hash = 0xbf76030e9e63ab7e, hostname = x100-00001N hash = 0xbf76030e9e63ab7f, hostname = x100-00001O hash = 0xbf76030e9e63aba0, hostname = x100-000023 hash = 0xbf76030e9e63aba1, hostname = x100-000022 hash = 0xbf76030e9e63aba2, hostname = x100-000021 hash = 0xbf76030e9e63aba3, hostname = x100-000020 hash = 0xbf76030e9e63aba4, hostname = x100-000027 hash = 0xbf76030e9e63aba5, hostname = x100-000026 hash = 0xbf76030e9e63aba6, hostname = x100-000025 hash = 0xbf76030e9e63aba7, hostname = x100-000024 hash = 0xbf76030e9e63abaa, hostname = x100-000029 hash = 0xbf76030e9e63abab, hostname = x100-000028 hash = 0xbf76030e9e63abc1, hostname = x100-00002R hash = 0xbf76030e9e63abc2, hostname = x100-00002Q hash = 0xbf76030e9e63abc3, hostname = x100-00002P hash = 0xbf76030e9e63abd0, hostname = x100-00002C hash = 0xbf76030e9e63abd1, hostname = x100-00002B hash = 0xbf76030e9e63abd2, hostname = x100-00002A hash = 0xbf76030e9e63abd4, hostname = x100-00002G hash = 0xbf76030e9e63abd5, hostname = x100-00002F hash = 0xbf76030e9e63abd6, hostname = x100-00002E hash = 0xbf76030e9e63abd7, hostname = x100-00002D hash = 0xbf76030e9e63abd8, hostname = x100-00002K hash = 0xbf76030e9e63abd9, hostname = x100-00002J hash = 0xbf76030e9e63abda, hostname = x100-00002I hash = 0xbf76030e9e63abdb, hostname = x100-00002H hash = 0xbf76030e9e63abdc, hostname = x100-00002O hash = 0xbf76030e9e63abdd, hostname = x100-00002N hash = 0xbf76030e9e63abde, hostname = x100-00002M hash = 0xbf76030e9e63abdf, hostname = x100-00002L ```

You can see there're lots of hash collisions when using prefix x100-. Maybe it's better to append /proc/sys/kernel/random/boot_id to hostname.