Closed jithinjosepkl closed 3 years ago
Right, the hashing algorithm is extremely basic so forcing a collision is not hard. Is that something that actually happened or is it a thought experiment ?
In any case, we can certainly look for a more robust hash.
Thanks @sjeaugey.
Is that something that actually happened or is it a thought experiment ?
Yes, this actually happened in a larger scale NCCL job.
Can you see if Bernstein's modified DJB2a works for you? i.e. replace the + with an xor;
result = ((result << 5) + result) ^ string[c];
@AddyLaddy - yep, that avoids the collision, and the scale experiment is happy.
Are you planning to invest on a more robust hash? Otherwise, I can go ahead and submit a PR to switch to modified DJB2a hash function.
I was looking into that indeed, as to how much unlikely we were to have a collision with DJB2a relative to DJB2, and whether we were really unlucky in your case or it could happen on a regular basis. I couldn't find much to confirm that however.
We could use stronger hashes, like MD5 or SHA, but that would add a dependency to openssl.
Another solution would be to use the full hostname everywhere instead of a hash, but that means transmitting much more data (which might not be a big deal) and making comparisons and hostname manipulation in general much more complex.
This collision is very easy to hit with Azure VMSS deployments. The modified DJB2a hash looks good so far, though we cannot really guarantee that this will not happen as we scale higher.
It would be great if we can switch to DJIB2a hash (at least as an interim solution) otherwise users may easily hit this error on Azure VMSS.
Thanks for the PR @jithinjosepkl. I would think we need a fix in NCCL as well, right ? We use the same function to determine whether we are communicating intra-node or inter-node between ranks. Or is there something I'm missing ?
I think it will be good to, but I haven't hit this issue with NCCL [to be precise, not yet ;)].
It could be because the boot_id
is appended to hostname. (https://github.com/NVIDIA/nccl/blob/c6dbdb00849027b4e2c277653cbef53729f7213d/src/misc/utils.cc#L103)
Ah. You're right. We no longer rely on the host hash (or at least not only) as we found the boot_id being much more reliable.
Thanks for reminding me of how my code works :smile:. I'll merge the PR.
Hi @jithinjosepkl and @sjeaugey,
The issue still exists for Azure VMSS, even in a small scale when prefix varies.
Simple repo after using DJIB2a
hash:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static uint64_t getHostHash(const char* string) {
// Based on DJB2a, result = result * 33 ^ char
uint64_t result = 5381;
for (int c = 0; string[c] != '\0'; c++){
result = ((result << 5) + result) ^ string[c];
}
return result;
}
static char* getVmssSuffix(int id) {
// Azure VMSS naming follows ${user-defined-prefix}${6-digit-base36-id} format
static char base36[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
static char suffix[] = "000000";
unsigned int pt = 6;
do {
suffix[--pt] = base36[id % 36];
} while (id /= 36);
return suffix;
}
int main(int argc, char *argv[]) {
for (int i = 0; i < 100; i ++) {
char hostname[50] = {0};
sprintf(hostname, "%s%s", argv[1], getVmssSuffix(i));
printf("hash = 0x%" PRIx64 ", hostname = %s\n", getHostHash(hostname), hostname);
}
return 0;
}
Then run with:
gcc hash.c -o hash
./hash x100- | sort
You can see there're lots of hash collisions when using prefix x100-
.
Maybe it's better to append /proc/sys/kernel/random/boot_id
to hostname
.
getHostHash() in src/common.h maps different hostnames to same hashvalue:
Simple repro:
This will result in incorrect localRank calculation, and eventually result in 'invalid device ordinal' error in cudaGetDeviceProperties.