bcgsc / ntHash

Fast hash function for DNA/RNA sequences
http://bcgsc.github.io/ntHash/
MIT License
96 stars 13 forks source link

Problem in rol. Undefined for s=64, probably does work when s>64. #1

Closed rsharris closed 7 years ago

rsharris commented 7 years ago

Looking at the implementation of rol in nthash.hpp, line 168, and getFhval at line 178.

getFhval will give rol values for s (shift value) in the range 0..k-1. When s>64, rol will try to compute v << t, where t is a negative value. It's not clear what value that will produce, and it might be undefined in the C standard. Moreover, the v>>s half of the result is (probably) zero.

Also, when s=0, v<<(64-2) becomes v<<64 in a 64 bit word. Surprisingly, this result is described as undefined in the C standard. And in some compilers (e.g. clang), I've observed that the result can be non-zero (zero is clearly the expectation in rol).

So it appears that the hash function is broken if k>64. I'm unable to test this, though, since I wasn't able to build the code today (my compiler doesn't support one of the linker options in the Makefile). If I'm right about that being broken, I wonder if this is responsible for the poorer results of ntCard when k>64.

One solution would be for rol to use v >> (s&0x3F) and v << ((64-s)&0x3F). A similar correction is needed in ror().

rsharris commented 7 years ago

(I able to compile now, having removed the OMP stuff)

I've confirmed the issue. On my machine, a Mac with clang-800.0.42.1, rol(0x3c8bfbb395c60474,0) returns 0xFFFFFFFFFFFFFFFF rol(0x3c8bfbb395c60474,64) returns 0xFFFFFFFFFFFFFFFF rol(0x3c8bfbb395c60474,65) returns 0x00007FFF5163F6A8

If I mask the shift counts to 6 bits,as indicated in the earlier post, rol produces the correct values.

Not as easy to prove the hash function is wrong, but it's clear that when k>=64, getFhval() is going to calling rol() with values of s that will produce the wrong answer.

rsharris commented 7 years ago

Regarding the mention of ntCard in my earlier post. IN tables 2, 3, and 4 of the ntCard manuscript, the error of ntCard is much higher when k>64.

Considering the effect of the error in rol() and ror(), I believe this means that the reported hash value of a kmer w may vary depending on whether it was computed by sliding window or not, and what its left-flanking character is.

benvvalk commented 7 years ago

@rsharris:

Thanks very much for taking the time to report this!

Hamid is on vacation right now, so he may be a bit slow to respond to your comments. But I'm sure he will appreciate your feedback, and will be keen to address the issue.

mohamadi commented 7 years ago

@rsharris Thank you very much for your feedback on this. I'll get back to all your comments soon when I get full access on internet.

Just a quick note, ntHash uses precomputed msTab as default instead rotate operations, where it uses % op to handle k>=64, so there is no limit with the k value. You can see this in NTP... functions, where P stands for Precomputted seeds. I will get back to you and the details soon.

Thanks again for your valuable feedbacks.

rsharris commented 7 years ago

Ahhh, I see. I hadn't gone that far down the file. So while some of the NTP functions do use rol(), they only use it with s=1, which shouldn't fail. And I see (now) that the NTP functions are the ones demonstrated in the README's code samples.

Still probably a good idea to fix the NTxxx functions, but the problem doesn't have the impact that I inferred.

mohamadi commented 7 years ago

@rsharris Hi Bob, I just tested your examples on my MacBook Air with clang-700.0.72 the following spec, but it's working correctly:

Hamids-MacBook-Air:nthash hmohamadi$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/c++/4.2.1
Apple LLVM version 7.0.0 (clang-700.0.72)
Target: x86_64-apple-darwin16.3.0
Thread model: posix
Hamids-MacBook-Air:nthash hmohamadi$ ./ntvalid 
rol(0x3c8bfbb395c60474,0) = 0x3c8bfbb395c60474
rol(0x3c8bfbb395c60474,64) = 0x3c8bfbb395c60474
rol(0x3c8bfbb395c60474,65) = 0x7917f7672b8c08e8
rsharris commented 7 years ago

Hmmm ... I did a fresh download of the repo to my machine at home ... and I still get the wrong answers:

rol(0x3c8bfbb395c60474,0) = ffffffffffffffff rol(0x3c8bfbb395c60474,64) = ffffffffffffffff rol(0x3c8bfbb395c60474,65) = 12068 rol(0x3c8bfbb395c60474,68) = 12068

I have a different version of clang than you

g++ --version Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/c++/4.2.1 Apple LLVM version 8.0.0 (clang-800.0.42.1) Target: x86_64-apple-darwin15.6.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

My main program is (with headers copied from nttest.cpp):

int main(int argc, char** argv) {
    std::cerr << "rol(0x3c8bfbb395c60474,0)  = " << std::hex << rol(0x3c8bfbb395c60474,0)  << "\n";
    std::cerr << "rol(0x3c8bfbb395c60474,64) = " << std::hex << rol(0x3c8bfbb395c60474,64) << "\n";
    std::cerr << "rol(0x3c8bfbb395c60474,65) = " << std::hex << rol(0x3c8bfbb395c60474,65) << "\n";
    std::cerr << "rol(0x3c8bfbb395c60474,68) = " << std::hex << rol(0x3c8bfbb395c60474,68) << "\n";
    return 0;
}

Also note that I had to skip the autogen/configure stuff and just use a Makefile I copied from http://www.bcgsc.ca/platform/bioinfo/software/nthash (and modified to avoid OpenML). Compilation command is g++ -O3 -I. -Ilib -o nttest nttest.cpp lib/city.cc lib/xxhash.c Are there other compiler flags that I am missing?

mohamadi commented 7 years ago

Seems -O3 makes all these happen. Can you make another test without it?

rsharris commented 7 years ago

Without -O3 I do get the correct results for rol().

This is strange behavior. Without -O3 I get this

uint64_t x = 0x3c8bfbb395c60474;
int s = 68;
x << s      is 0xc8bfbb395c604740
x >> (64-s) is 0x0000000000000003

While that's what's needed to make rol() work, the <<68 seems to be wrong according to the C standard ( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf , section 5.8):

The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2^E2, reduced modulo one more than the maximum value representable in the result type. "2^E2" is 2 raised to to E2 (the standard has a superscript E2).

This should give us (0x3c8bfbb395c60474 x 2^68) mod 2^64, which should be zero. (right?)

mohamadi commented 7 years ago

@rsharris When shifting E1 by E2>=64, it will usually get shifted to E2%64 or E2&63, but we shouldn't rely on this since it is not defined by the C standard and can be different on different architectures. Also this is the case for E2<0:

(C11, 6.5.7p3) "If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined"

(C++11, 5.8p1) "The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand."

I just added the fix for rol and ror on the ntHash library for the case users want to explicitly use these opts:

// rotate "v" to the left by "s" positions
inline uint64_t rol(const uint64_t v, int s) {
        if ((s &= 63) == 0)
                return v;
        return (v << s) | (v >> (64 - s));
}

// rotate "v" to the right by "s" positions
inline uint64_t ror(const uint64_t v, int s) {
        if ((s &= 63) == 0)
                return v;
        return (v >> s) | (v << (64 - s));
}

Thanks again for catching this.

rsharris commented 7 years ago

but we shouldn't rely on this since it is not defined by the C standard and can be different on different architectures.

Exactly.

I just added the fix for rol and ror ...

Those look correct, to my eye.