Unbundling oneDNN - Githubissues

h-vetinari commented 2 years ago

Breaking out the discussion from #175:

@njzjz: Currently tensorflow here is not built against MKL (OneDNN), but do we plan to?

@xhochy: Tensorflow is building oneDNN from source in the build currently. That is probably the first 1-2h of CI. Getting that from the feedstock instead of building from source would help a lot.

I looked at how upstream builds this, and - annoyingly - it's a bit of a mess. There seem to be three different versions of onednn being pulled in, with one of them (0.21.3) being quite old and not available in conda-forge.

I'm happy to start a PR that tries to rip things and see how far we get, but just wanted to get an opinion (@xhochy?) if replacing the upstream version is even realistic given the different versions? Or maybe we just remove the most important one?

h-vetinari commented 2 years ago

Here are the build files for the different versions (the versions being manually set in the bazel scripts don't match the artefacts they're pulling in).

ngam commented 2 years ago

Maybe I am a bit late or uninformed about this, but the issue with onednn stuff is 1) kinda not worth it in my testing and 2) it is actually really easy to link them after the fact, you just need to declare an env variable TF_ENABLE_ONEDNN_OPTS, but I am not sure if that's actually as easy as it seems to be. I have tended to add

  - onednn # optimization for intel onednn library
  - pip
  - pip:
    - tensorflow # pip install for 2.7.0 (cpu)
    - tensorflow-gan

variables:
  TF_ENABLE_ONEDNN_OPTS: 1 # enables onednn optimization

at the end of my env.yml file (for cpus; I usually get tensorflow-gpu from conda for GPUs). Does the build need to be modified for this env variable to work? I remember it working fine with the conda-forge tensorflow, though I usually get the CPU version from pip (as above).

See the intel info for reference here.

ngam commented 2 years ago

Maybe a simple test can also be added to ensure this? I usually test by running this python file... which is from their "sanity check", but if you look closely, it seems like a phony test. Anyway, the Intel optimization for tensorflow is very underwhelming. I really don't think Intel is serious about this stuff, and their efforts seem to be haphazard and super low-energy. Both their devcloud as well as the whole oneapi stuff are a total mess. (You can browser their github repos to see what they are up to...)

import tensorflow as tf

import os

def get_mkl_enabled_flag():

    mkl_enabled = False
    major_version = int(tf.__version__.split(".")[0])
    minor_version = int(tf.__version__.split(".")[1])
    if major_version >= 2:
        if minor_version < 5:
            from tensorflow.python import _pywrap_util_port
        else:
            from tensorflow.python.util import _pywrap_util_port
            onednn_enabled = int(os.environ.get('TF_ENABLE_ONEDNN_OPTS', '0'))
        mkl_enabled = _pywrap_util_port.IsMklEnabled() or (onednn_enabled == 1)
    else:
        mkl_enabled = tf.pywrap_tensorflow.IsMklEnabled()
    return mkl_enabled

print ("We are using Tensorflow version", tf.__version__)
print("MKL enabled :", get_mkl_enabled_flag())
print(tf.config.list_physical_devices())
print(tf.test.is_gpu_available())
print(tf.test.gpu_device_name())

xhochy commented 2 years ago

@ngam This issue is not about using/not using the oneDNN ops but whether we built oneDNN as part of the tensorflow build or whether we can just link against the existing oneDNN package.

ngam commented 2 years ago

@ngam This issue is not about using/not using the oneDNN ops but whether we built oneDNN as part of the tensorflow build or whether we can just link against the existing oneDNN package.

Oooh… thanks for explaining. Well, the latter makes much more sense (linking against existing one) especially in light of what I said about the actual utility of it. Then you can test if the linking works…

ngam commented 2 years ago

I would like to revisit my statement above. I finally could see the advantage of onednn optimization to speed up code by 50% for smaller loads, but up to %400 for more sustained loads. Also, the AVX-512 targeting seems to play a key role — that alone can result in significant speedup.

Interestingly, the tensorflow from conda-forge (2.7.0) prints out

2022-04-28 12:42:42.979985: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA

but usually tensorflow from PyPI or Intel prints out

2022-04-28 12:45:23.794234: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA

so... seemingly, we have better optimization here? This results in a bit of performance improvement (i.e. comparing v2.7.0 from conda-forge to PyPI or Intel). My current case is one with a lot of intermediate mapping (and dealing with very large files).

Edit: NGC containers from NVIDIA behave very similarly to conda-forge's tensorflow, including the small, but noticeable, speedup.

h-vetinari commented 2 years ago

Interestingly, the tensorflow from conda-forge (2.7.0) prints out
[...]:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA

Do we know how this is handled for older CPUs (e.g. without AVX512) - gracefully or do things crash?

Before we have something like https://github.com/conda-forge/conda-forge.github.io/issues/1261, we shouldn't be compiling things that hard-require more than SSE4 (I think...). Or perhaps AVX can be the new baseline?

As an example, faiss (both upstream and in conda-forge) does fat binaries that compile both against SSE4 and AVX2 and load the latter if possible at runtime, but that's not a scalable approach IMO. See https://github.com/conda-forge/faiss-split-feedstock/issues/23

ngam commented 2 years ago

Do we know how this is handled for older CPUs (e.g. without AVX512) - gracefully or do things crash?

Works perfectly fine!

Actually, not only does it work perfectly fine, for some reason, it is still a bit faster (even on machines without avx-512) than the Intel and PyPI tensorflows.

ngam commented 2 years ago

I don't know if this is mainly to do with my esoteric data pipeline (shuffling ~100s GB of data, so the CPU performance actually matters in my case even when running on GPUs). I did a few more tests to see if it is consistent across different machines (CPU-only, GPU-CPU, different architectures) and it help up

h-vetinari commented 2 years ago

Works perfectly fine!

That's great news - seems that this is built with fall-backs in place, which isn't always the case (without guards, if you try to run AVX2 code on a non-AVX2 machine, you'll normally crash loudly).

Out of curiosity, what does cpu_feature_guard.cc print on a non-AVX512 machine?

ngam commented 2 years ago

Works perfectly fine!

That's great news - seems that this is built with fall-backs in place, which isn't always the case (without guards, if you try to run AVX2 code on a non-AVX2 machine, you'll normally crash loudly).

yep try pip install intel-tensorflow-avx512...

Out of curiosity, what does cpu_feature_guard.cc print on a non-AVX512 machine?

Same, just without the AVX512

2022-04-28 22:43:22.335803: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA

mamba list | grep tensorflow
tensorflow                2.7.0           cuda112py310he87a039_0    conda-forge
tensorflow-base           2.7.0           cuda112py310h2bd284a_0    conda-forge
tensorflow-estimator      2.7.0           cuda112py310h922d117_0    conda-forge

lscpu | grep avx
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

h-vetinari commented 2 years ago

Out of curiosity, what does cpu_feature_guard.cc print on a non-AVX512 machine?

Same, just without the AVX512

That's kind of the point - it has some dynamic feature detection and presumably uses this behind the scenes to load the correct libs/symbols. 🙃

ngam commented 2 years ago

How do you make sense of the above with this line in build.sh?

https://github.com/conda-forge/tensorflow-feedstock/blob/c43f2820986d3eb74f63c23eadf85313dc54c635/recipe/build.sh#L134

ngam commented 2 years ago

Also, in the future should we follow nvidia's build? Notable differences are:

CC_OPT_FLAGS="-march=sandybridge -mtune=broadwell"
TF_ENABLE_XLA=1

```shell export TF_NEED_CUDA=1 export TF_NEED_CUTENSOR=1 export TF_NEED_TENSORRT=1 export TF_CUDA_PATHS=/usr,/usr/local/cuda export TF_CUDA_VERSION=$(echo "${CUDA_VERSION}" | cut -d . -f 1-2) export TF_CUBLAS_VERSION=$(echo "${CUBLAS_VERSION}" | cut -d . -f 1) export TF_CUDNN_VERSION=$(echo "${CUDNN_VERSION}" | cut -d . -f 1) export TF_NCCL_VERSION=$(echo "${NCCL_VERSION}" | cut -d . -f 1) export TF_TENSORRT_VERSION=$(echo "${TRT_VERSION}" | cut -d . -f 1) export TF_ENABLE_XLA=1 export TF_NEED_HDFS=0 if [ "${TARGETARCH}" = "amd64" ] ; then export CC_OPT_FLAGS="-march=sandybridge -mtune=broadwell" ; fi if [ "${TARGETARCH}" = "arm64" ] ; then export CC_OPT_FLAGS="-march=armv8-a" ; fi export TF_USE_CCACHE # This is the flags that Google use internally with clang. # Adding them help make the compilation error more readable even if g++ doesn't do exactly as clang. # Google use -Werror too. But we can't use it as the error from g++ isn't exactly the same and will trigger useless compilation error. export CC_OPT_FLAGS="$CC_OPT_FLAGS -Wno-address-of-packed-member -Wno-defaulted-function-deleted -Wno-enum-compare-switch -Wno-expansion-to-defined -Wno-ignored-attributes -Wno-ignored-qualifiers -Wno-inconsistent-missing-override -Wno-int-in-bool-context -Wno-misleading-indentation -Wno-potentially-evaluated-expression -Wno-psabi -Wno-range-loop-analysis -Wno-return-std-move -Wno-sizeof-pointer-div -Wno-sizeof-array-div -Wno-string-concatenation -Wno-tautological-constant-compare -Wno-tautological-type-limit-compare -Wno-tautological-undefined-compare -Wno-tautological-unsigned-zero-compare -Wno-tautological-unsigned-enum-zero-compare -Wno-undefined-func-template -Wno-unused-lambda-capture -Wno-unused-local-typedef -Wno-void-pointer-to-int-cast -Wno-uninitialized-const-reference -Wno-compound-token-split -Wno-ambiguous-member-template -Wno-char-subscripts -Wno-error=deprecated-declarations -Wno-extern-c-compat -Wno-gnu-alignof-expression -Wno-gnu-variable-sized-type-not-at-end -Wno-implicit-int-float-conversion -Wno-invalid-source-encoding -Wno-mismatched-tags -Wno-pointer-sign -Wno-private-header -Wno-sign-compare -Wno-signed-unsigned-wchar -Wno-strict-overflow -Wno-trigraphs -Wno-unknown-pragmas -Wno-unused-const-variable -Wno-unused-function -Wno-unused-private-field -Wno-user-defined-warnings -Wvla -Wno-reserved-user-defined-literal -Wno-return-type-c-linkage -Wno-self-assign-overloaded -Woverloaded-virtual -Wnon-virtual-dtor -Wno-deprecated -Wno-invalid-offsetof -Wimplicit-fallthrough -Wno-final-dtor-non-final-class -Wno-c++20-designator -Wno-register -Wno-dynamic-exception-spec" ```

conda-forge / tensorflow-feedstock

Unbundling oneDNN #183