Closed rsharris closed 7 years ago
(I able to compile now, having removed the OMP stuff)
I've confirmed the issue. On my machine, a Mac with clang-800.0.42.1, rol(0x3c8bfbb395c60474,0) returns 0xFFFFFFFFFFFFFFFF rol(0x3c8bfbb395c60474,64) returns 0xFFFFFFFFFFFFFFFF rol(0x3c8bfbb395c60474,65) returns 0x00007FFF5163F6A8
If I mask the shift counts to 6 bits,as indicated in the earlier post, rol produces the correct values.
Not as easy to prove the hash function is wrong, but it's clear that when k>=64, getFhval() is going to calling rol() with values of s that will produce the wrong answer.
Regarding the mention of ntCard in my earlier post. IN tables 2, 3, and 4 of the ntCard manuscript, the error of ntCard is much higher when k>64.
Considering the effect of the error in rol() and ror(), I believe this means that the reported hash value of a kmer w may vary depending on whether it was computed by sliding window or not, and what its left-flanking character is.
@rsharris:
Thanks very much for taking the time to report this!
Hamid is on vacation right now, so he may be a bit slow to respond to your comments. But I'm sure he will appreciate your feedback, and will be keen to address the issue.
@rsharris Thank you very much for your feedback on this. I'll get back to all your comments soon when I get full access on internet.
Just a quick note, ntHash uses precomputed msTab as default instead rotate operations, where it uses % op to handle k>=64, so there is no limit with the k value. You can see this in NTP... functions, where P stands for Precomputted seeds. I will get back to you and the details soon.
Thanks again for your valuable feedbacks.
Ahhh, I see. I hadn't gone that far down the file. So while some of the NTP functions do use rol(), they only use it with s=1, which shouldn't fail. And I see (now) that the NTP functions are the ones demonstrated in the README's code samples.
Still probably a good idea to fix the NTxxx functions, but the problem doesn't have the impact that I inferred.
@rsharris Hi Bob, I just tested your examples on my MacBook Air with clang-700.0.72 the following spec, but it's working correctly:
Hamids-MacBook-Air:nthash hmohamadi$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/c++/4.2.1
Apple LLVM version 7.0.0 (clang-700.0.72)
Target: x86_64-apple-darwin16.3.0
Thread model: posix
Hamids-MacBook-Air:nthash hmohamadi$ ./ntvalid
rol(0x3c8bfbb395c60474,0) = 0x3c8bfbb395c60474
rol(0x3c8bfbb395c60474,64) = 0x3c8bfbb395c60474
rol(0x3c8bfbb395c60474,65) = 0x7917f7672b8c08e8
Hmmm ... I did a fresh download of the repo to my machine at home ... and I still get the wrong answers:
rol(0x3c8bfbb395c60474,0) = ffffffffffffffff rol(0x3c8bfbb395c60474,64) = ffffffffffffffff rol(0x3c8bfbb395c60474,65) = 12068 rol(0x3c8bfbb395c60474,68) = 12068
I have a different version of clang than you
g++ --version Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/c++/4.2.1 Apple LLVM version 8.0.0 (clang-800.0.42.1) Target: x86_64-apple-darwin15.6.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
My main program is (with headers copied from nttest.cpp):
int main(int argc, char** argv) {
std::cerr << "rol(0x3c8bfbb395c60474,0) = " << std::hex << rol(0x3c8bfbb395c60474,0) << "\n";
std::cerr << "rol(0x3c8bfbb395c60474,64) = " << std::hex << rol(0x3c8bfbb395c60474,64) << "\n";
std::cerr << "rol(0x3c8bfbb395c60474,65) = " << std::hex << rol(0x3c8bfbb395c60474,65) << "\n";
std::cerr << "rol(0x3c8bfbb395c60474,68) = " << std::hex << rol(0x3c8bfbb395c60474,68) << "\n";
return 0;
}
Also note that I had to skip the autogen/configure stuff and just use a Makefile I copied from http://www.bcgsc.ca/platform/bioinfo/software/nthash (and modified to avoid OpenML). Compilation command is
g++ -O3 -I. -Ilib -o nttest nttest.cpp lib/city.cc lib/xxhash.c
Are there other compiler flags that I am missing?
Seems -O3
makes all these happen. Can you make another test without it?
Without -O3 I do get the correct results for rol().
This is strange behavior. Without -O3 I get this
uint64_t x = 0x3c8bfbb395c60474;
int s = 68;
x << s is 0xc8bfbb395c604740
x >> (64-s) is 0x0000000000000003
While that's what's needed to make rol() work, the <<68 seems to be wrong according to the C standard ( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf , section 5.8):
The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2^E2, reduced modulo one more than the maximum value representable in the result type. "2^E2" is 2 raised to to E2 (the standard has a superscript E2).
This should give us (0x3c8bfbb395c60474 x 2^68) mod 2^64, which should be zero. (right?)
@rsharris When shifting E1
by E2>=64
, it will usually get shifted to E2%64
or E2&63
, but we shouldn't rely on this since it is not defined by the C standard and can be different on different architectures. Also this is the case for E2<0
:
(C11, 6.5.7p3) "If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined"
(C++11, 5.8p1) "The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand."
I just added the fix for rol and ror on the ntHash library for the case users want to explicitly use these opts:
// rotate "v" to the left by "s" positions
inline uint64_t rol(const uint64_t v, int s) {
if ((s &= 63) == 0)
return v;
return (v << s) | (v >> (64 - s));
}
// rotate "v" to the right by "s" positions
inline uint64_t ror(const uint64_t v, int s) {
if ((s &= 63) == 0)
return v;
return (v >> s) | (v << (64 - s));
}
Thanks again for catching this.
but we shouldn't rely on this since it is not defined by the C standard and can be different on different architectures.
Exactly.
I just added the fix for rol and ror ...
Those look correct, to my eye.
Looking at the implementation of rol in nthash.hpp, line 168, and getFhval at line 178.
getFhval will give rol values for s (shift value) in the range 0..k-1. When s>64, rol will try to compute v << t, where t is a negative value. It's not clear what value that will produce, and it might be undefined in the C standard. Moreover, the v>>s half of the result is (probably) zero.
Also, when s=0, v<<(64-2) becomes v<<64 in a 64 bit word. Surprisingly, this result is described as undefined in the C standard. And in some compilers (e.g. clang), I've observed that the result can be non-zero (zero is clearly the expectation in rol).
So it appears that the hash function is broken if k>64. I'm unable to test this, though, since I wasn't able to build the code today (my compiler doesn't support one of the linker options in the Makefile). If I'm right about that being broken, I wonder if this is responsible for the poorer results of ntCard when k>64.
One solution would be for rol to use v >> (s&0x3F) and v << ((64-s)&0x3F). A similar correction is needed in ror().