Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
8.99k stars 770 forks source link

Use integers with the specific size #130

Closed koraa closed 6 years ago

koraa commented 6 years ago

The state is defined like this:

struct XXH32_state_s {
   unsigned total_len_32;
   unsigned large_len;
   unsigned v1;
   unsigned v2;
   unsigned v3;
   unsigned v4;
   unsigned mem32[4];   /* buffer defined as U32 for alignment */
   unsigned memsize;
   unsigned reserved;   /* never read nor write, will be removed in a future version */
};   /* typedef'd to XXH32_state_t */

The xxhash64 state uses unsigned long long; these are technically not guranteed to be 32bits (although they really usually are); the defines of specific size could be used to make sure the variables really are 32bit long (uint32_t).

Compiling on some arm targets I also get this:

xxhash.c:475:37: error: invalid conversion from 'unsigned int*' to 'const U32* {aka const long unsigned int*}'

This error is encountered because – even though unsigned int and long unsigned int actually are the same size (32bit) – the compiler considers them to be different types. I think this problem could also be alleviated by using the specific size types?

Cyan4973 commented 6 years ago

I suspect the issue is Windows related ? In which case, I should probably consider adding some Windows test to CI suite, in order to reproduce such issue.

koraa commented 6 years ago

Hiho, thanks for responding!!

In this specific case I am compiling for an arm cortex m7 (embedded, bare metal, no OS), so no windows at all is involved :) And with that one error fixed, xxhash works great in that env!

Edit: arm-none-eabi-g++ -Os -flto -mcpu=cortex-m7 -mthumb -mfpu=fpv5-sp-d16 -mfloat-abi=hard -lc -lm -lnosys -nostartfiles -gdwarf-2 -ggdb3 -Wall -fdata-sections -ffunction-sections -Wl,--gc-sections -std=c++17 -Wextra -Wpedantic -Wshadow -Wno-missing-braces -fno-exceptions -fno-rtti -DXXH_INLINE_ALL -DXXH_FORCE_MEMORY_ACCESS=1 gcc version 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204] (GNU Tools for Arm Embedded Processors 7-2017-q4-major)

Cyan4973 commented 6 years ago

Got it, I could reproduce it, using precisely arm-none-eabi-g++ as in your example. It wouldn't reproduce with any other arm/gcc variant (including arm-none-eabi-gcc or arm-linux-gnueabi-g++).

Cyan4973 commented 6 years ago

OK, so the main issue is, using <stdint.h>, while tempting, is causing portability problems. Not everything out there is able to provide this header, starting with Microsoft Visual <= 2012 (or 2013?), but also a lot of other less popular systems.

So, that's why there are 2 sets of definitions, one using "standard" types (unsigned, unsigned long long, etc.), and the other one using <stdint.h> (uint32_t, uint64_6, etc.). When one system is used in the *.h, and the other in the *.c, it may result in type incompatibility (depending on compiler permissiveness and settings).

So, I don't have a great solution here. Initially, that's one of the reasons I wanted to keep these definitions "private", and only export a "shell space", featuring correct space and alignment, but opaque regarding its members and types. It turns out this is cause for concerns with regards to the strict-aliasing rule.

So the work around was to actually publish the state, with a set of warnings to tell "do not use members !" (of course, some people do nonetheless use members ... :(). But now we also have this other problem of type incompatibility.

The solution selected by lz4 is to publish both versions. It's a bit verbose, but works, as long as the macro deciding which type system to use is the same on both *.h and *.c.

Short of a better solution, I guess that's what I will do here.

Cyan4973 commented 6 years ago

132 seems to work on my test configuration

koraa commented 6 years ago

Thanks for looking into this so thoroughly! I don do quite understand why you are not using typedefs instead of publishing two #ifdef versions? The patch definately works for my test case :)

Cyan4973 commented 6 years ago

I don do quite understand why you are not using typedefs instead of publishing two #ifdef versions?

This is to reduce namespace pollution. BYTE, U32 and such are too generic, and may overlap with user's own symbol names.

koraa commented 6 years ago

OK, thanks for the info and the effort!