Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.14k stars 777 forks source link

Requesting clarification regarding the use of __GNUC__ in xxhash code #709

Open rohanaggarwal7997 opened 2 years ago

rohanaggarwal7997 commented 2 years ago

I had a small question regarding the xxhash code. There is usage of __GNUC__ in some places in the code(in footnote) to get the performance of the compiled code by GCC similar to clang . My question is that even ICC defines __GNUC__. Was the code benchmarked for the performance considerations with ICC ? or should we be using an extra !defined(__INTEL_COMPILER) .

I am talking about the following two comments in your code

XXH_FORCE_INLINE xxh_u64 XXH3_mix16B(const xxh_u8* XXH_RESTRICT input,
                                     const xxh_u8* XXH_RESTRICT secret, xxh_u64 seed64)
{
#if defined(__GNUC__) && !defined(__clang__) /* GCC, not Clang */ \
  && defined(__i386__) && defined(__SSE2__)  /* x86 + SSE2 */ \
  && !defined(XXH_ENABLE_AUTOVECTORIZE)      /* Define to disable like XXH32 hack */
    /*
     * UGLY HACK:
     * GCC for x86 tends to autovectorize the 128-bit multiply, resulting in
     * slower code.
     *
     * By forcing seed64 into a register, we disrupt the cost model and
     * cause it to scalarize. See `XXH32_round()`
     *
     * FIXME: Clang's output is still _much_ faster -- On an AMD Ryzen 3600,
     * XXH3_64bits @ len=240 runs at 4.6 GB/s with Clang 9, but 3.3 GB/s on
     * GCC 9.2, despite both emitting scalar code.
     *
     * GCC generates much better scalar code than Clang for the rest of XXH3,
     * which is why finding a more optimal codepath is an interest.
     */
    XXH_COMPILER_GUARD(seed64);
/*
 * UGLY HACK:
 * GCC usually generates the best code with -O3 for xxHash.
 *
 * However, when targeting AVX2, it is overzealous in its unrolling resulting
 * in code roughly 3/4 the speed of Clang.
 *
 * There are other issues, such as GCC splitting _mm256_loadu_si256 into
 * _mm_loadu_si128 + _mm256_inserti128_si256. This is an optimization which
 * only applies to Sandy and Ivy Bridge... which don't even support AVX2.
 *
 * That is why when compiling the AVX2 version, it is recommended to use either
 *   -O2 -mavx2 -march=haswell
 * or
 *   -O2 -mavx2 -mno-avx256-split-unaligned-load
 * for decent performance, or to use Clang instead.
 *
 * Fortunately, we can control the first one with a pragma that forces GCC into
 * -O2, but the other one we can't control without "failed to inline always
 * inline function due to target mismatch" warnings.
 */
Cyan4973 commented 2 years ago

I would be surprised if this code performance hack was checked on ICC, though with godbolt, that might actually be possible. cc @easyaspi314 , the author of this code.

easyaspi314 commented 2 years ago

Yeah those should be for GCC only.

It is really annoying that in order to check for actual GCC you have to check for the imposters.

How about just adding something like this

/* Attempts to detect *actual* GCC. */
#define XXH_IS_GCC (defined(__GNUC__) && !defined(__clang__) && !defined(__INTEL_COMPILER))
/* Detects GCC and Clang (in clang-cl mode, clang does not define __GNUC__) */
#define XXH_IS_GNU (defined(__GNUC__) || defined(__clang__))

Just like how XXH_GCC_VERSION is used.

Cyan4973 commented 2 years ago

I like it

GitMensch commented 2 years ago

So PR missing?

Cyan4973 commented 1 year ago

Coming back to this topic, it seems there are still a few places in the code base where the following preprocessor test is present : #if defined(__GNUC__) && !defined(__clang__)

I presume it's the kind of test where excluding icc would (presumably) be useful too.