mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL

fhieber commented 7 years ago

We have observed crashes with any mkl-enabled pip package of mxnet-0.12.0 in combination with numpy if installed through conda (which by default also uses MKL).

In this case, mxnet trainings crash with the following error message:

OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

Numpy from conda links against the libmkl_rt.so, distributed through conda:

libmkl_rt.so => /opt/conda/lib/python3.6/site-packages/numpy/core/../../../../libmkl_rt.so (0x00007f05256e0000)
    libmkl_rt.so => /opt/conda/lib/python3.6/site-packages/numpy/linalg/../../../../libmkl_rt.so (0x00007f367e1d5000)
    libmkl_rt.so => /opt/conda/lib/python3.6/site-packages/numpy/linalg/../../../../libmkl_rt.so (0x00007fd39e751000)

whereas MXNet links to its own .so:

ldd /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so
[...]
    libmklml_intel.so => /opt/conda/lib/python3.6/site-packages/mxnet/libmklml_intel.so (0x00007f8c85b94000)
    libiomp5.so => /opt/conda/lib/python3.6/site-packages/mxnet/libiomp5.so (0x00007f8c857f1000)
[...]

This prevents people from using numpy w/ MKL in combination with mxnet-mkl==0.12.0.

zhreshold commented 7 years ago

@szha

szha commented 7 years ago

@ykim362 what's your recommendation on this? MKL2018 doesn't distribute .a, which means that I can't hide any of its symbols when loading.

srochel commented 7 years ago

@b0noI - how do we resolve the issue on DLAMI?

ykim362 commented 7 years ago

@szha I am discussing with MKL team to figure out how to handle this situation. I will get back to you, soon.

ykim362 commented 7 years ago

@fhieber Could you provide your numpy (conda) version?

cjolivier01 commented 6 years ago

@szha, Can we link libiomp5.a statically and then strip the symbols from the final libmxnet.so? I am theorizing (I think correctly) that two completely separate versions of openmp can coexist when they apply only to two different modules.

larroy commented 6 years ago

@cjolivier01 how can this work, when the data structures passed to the functions of the library change? This would only work if both versions of openmp would guarantee binary compatibility. Is that the case?

cjolivier01 commented 6 years ago

@larroy Not sure what you're trying to say here. If the symbols are stripped from libmxnet.so, then the libnumpy.so (or whatever it's called)'s omp.so and libmxnet.so (with its internal omp pool) won't see each other at all, much less call into each other.

What you'll get is numpy using a separate thread pool than libmxnet.so, which probably doesn't matter much because they generally won't be running in parallel anyway.

ykim362 commented 6 years ago

@cjolivier01 When I discussed this issue with MKL team, it is not much feasible to make static MKLML given that it has lots of binary dependencies. Another aspect is that there is a possibility that two separate OpenMP calls (from numpy and from MXNet with MKLML) could degrade the performance. We are working on several experimentations. I would like to update the result next week.

larroy commented 6 years ago

@cjolivier01 I see. I thought the problem was linking with the same symbols. I found this related to the issue: https://software.intel.com/en-us/inference-engine-devguide-known-issues

ykim362 commented 6 years ago

@cjolivier01 From the experimentations, the behavior varies case by case. Some combinations of versions work fine. Some are not. So, the workaround @larroy pointed (LD_PRELOAD env variable) might be a good option. Also, it is recommended to use MKL-enabled MXNet with non-MKL python to get the best performance for neural network workloads.

szha commented 6 years ago

@ykim362 our goal is to be able to provide good out-of-the-box experience for MKL-enabled mxnet. Though the workaround might be good for experienced users, we cannot control the environment in which mxnet runs for pip users. Thus, I'm still leaning towards the way @cjolivier01 proposed above if possible.

Do you mind sharing more on the dependencies of mklml to help figure out the feasibility of static-linking?

ykim362 commented 6 years ago

@szha I totally agree with you regarding the OOB experience! Actually, our concern is that static linking doesn't solve the problem. How about trying rpath compile option?

@fhieber Could you provide some more detailed information to re-generate this situation? The behavior varies, so I would like to test with the exactly same environment.

szha commented 6 years ago

@ykim362 could you elaborate? Currently I'm using '${ORIGIN}' in rpath for mkl builds already, and shipping mkl shared objects along with libmxnet.so. I don't think this prevents other omp from being loaded first and hence doesn't solve our problem. Are you referring to a different solution that's based on rpath?

rsdubtso commented 6 years ago

MKL dev here. Can somebody please verify that the pip packages are not linked to OpenMP (Intel's libiomp5) statically? If any one of them is, then this would be a problem. If every single package links to OpenMP dynamically, this should probably work (disclaimer: this needs to be verified; the libmkl_rt.so on Linux loads libiomp5.so via dlopen and I'm not 100% sure what would happen if the library is loaded multiple times for example using RTLD_LOCAL...)

szha commented 6 years ago

@rsdubtso Hi. I can verify that mxnet pip package is linked dynamically to libiomp5.so. We include libiomp5.so in the pip package using the rpath trick described above.

rsdubtso commented 6 years ago

Hm. Seems like the numpy package also uses dynamic linking to libiomp5. Do you have a small repro that I could try? (Please do include all the necessary package installation steps -- I'm pretty new to all this stuff...)

PS. I also tried installing mxnet for python2.7 conda and it seems to be missing libmxnet.so completely... should I be using conda for python3?

ykim362 commented 6 years ago

@fhieber Could you provide the conda version you used or some more detailed environment when you got this problem? mxnet-mkl==0.12.0 and 0.12.1 work fine with conda in my tests.

fhieber commented 6 years ago

@ykim362 I just tried this again in a Docker image with mxnet-cu90mkl==1.0.0.post4 and the following conda info output:

               platform : linux-64
          conda version : 4.3.30
       conda is private : False
      conda-env version : 4.3.30
    conda-build version : not installed
         python version : 3.6.3.final.0
       requests version : 2.18.4
       root environment : /opt/conda  (writable)
    default environment : /opt/conda
       envs directories : /opt/conda/envs
                          /root/.conda/envs
          package cache : /opt/conda/pkgs
                          /root/.conda/pkgs
           channel URLs : https://repo.continuum.io/pkgs/main/linux-64
                          https://repo.continuum.io/pkgs/main/noarch
                          https://repo.continuum.io/pkgs/free/linux-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/r/linux-64
                          https://repo.continuum.io/pkgs/r/noarch
                          https://repo.continuum.io/pkgs/pro/linux-64
                          https://repo.continuum.io/pkgs/pro/noarch
            config file : None
             netrc file : None
           offline mode : False
             user-agent : conda/4.3.30 requests/2.18.4 CPython/3.6.3 Linux/4.9.58-18.51.amzn1.x86_64 / glibc/2.17
                UID:GID : 0:0

I still observe the same crash:

> OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
> OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

pengzhao-intel commented 6 years ago

@fhieber @rsdubtso is correct. It caused by multiple omp lib. The workaround is to set KMP_DUPLICATE_LIB_OK=TRUE .

I am considering the final solution now. Will come back soon :)

Error log:

[INFO:sockeye.training] Using bucketing. Default max_seq_len=(60, 60) [INFO:main] Optimizer: adam [INFO:main] Optimizer Parameters: {'wd': 0.0, 'learning_rate': 0.0003, 'lr_scheduler': LearningRateSchedulerPlateauReduce(reduce_factor=0.50, reduce_num_not_improved=0), 'clip_gradient': 1.0, 'rescale_grad': 1.0} [INFO:main] kvstore: device [INFO:main] Gradient Compression: None [INFO:main] Decode and Evaluate Device(s): cpu(0) [INFO:sockeye.model] Saved config to "/home/chenxiny/newstest/wmt_model/config" OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized. OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. Aborted

After setting the environment, the problem is gone. New log:

[INFO:sockeye.utils] Sockeye version 1.16.2 commit 762ce78e4e49b9ba5d14eb0a48d97f19c8807707 [INFO:sockeye.utils] MXNet version 1.1.0 [INFO:sockeye.utils] Command: /home/chenxiny/anaconda3/lib/python3.6/site-packages/sockeye/train.py -s corpus.tc.BPE.de -t corpus.tc.BPE.en -vs newstest2016.tc.BPE.de -vt newstest2016.tc.BPE.en --num-embed 256 --rnn-num-hidden 512 --rnn-attention-type dot --max-seq-len 60 --decode-and-evaluate 500 -o wmt_model [INFO:sockeye.utils] Arguments: Namespace(allow_missing_params=False, batch_size=64, batch_type='sentence', bucket_width=10, checkpoint_frequency=1000, cnn_activation_type='glu', cnn_hidden_dropout=0.0, cnn_kernel_width=(3, 5), cnn_num_hidden=512, cnn_positional_embedding_type='learned', cnn_project_qkv=False, conv_embed_add_positional_encodings=False, conv_embed_dropout=0.0, conv_embed_max_filter_width=8, conv_embed_num_filters=(200, 200, 250, 250, 300, 300, 300, 300), conv_embed_num_highway_layers=4, conv_embed_output_dim=None, conv_embed_pool_stride=5, decode_and_evaluate=500, decode_and_evaluate_device_id=None, decode_and_evaluate_use_cpu=False, decoder='rnn', device_ids=[-1], disable_device_locking=False, embed_dropout=(0.0, 0.0), embed_weight_init='default', encoder='rnn', fill_up='replicate', gradient_clipping_threshold=1.0, gradient_clipping_type='abs', gradient_compression_threshold=0.5, gradient_compression_type=None, initial_learning_rate=0.0003, keep_last_params=-1, kvstore='device', label_smoothing=0.0, layer_normalization=False, learning_rate_decay_optimizer_states_reset='off', learning_rate_decay_param_reset=False, learning_rate_half_life=10, learning_rate_reduce_factor=0.5, learning_rate_reduce_num_not_improved=3, learning_rate_schedule=None, learning_rate_scheduler_type='plateau-reduce', learning_rate_warmup=0, lock_dir='/tmp', loss='cross-entropy', loss_normalization_type='valid', max_num_checkpoint_not_improved=8, max_num_epochs=None, max_seq_len=(60, 60), max_updates=None, metrics=['perplexity'], min_num_epochs=None, momentum=None, monitor_pattern=None, monitor_stat_func='mx_default', no_bucketing=False, num_embed=(256, 256), num_layers=(1, 1), num_words=(50000, 50000), optimized_metric='perplexity', optimizer='adam', optimizer_params=None, output='wmt_model', overwrite_output=False, params=None, prepared_data=None, quiet=False, rnn_attention_coverage_num_hidden=1, rnn_attention_coverage_type='count', rnn_attention_in_upper_layers=False, rnn_attention_mhdot_heads=None, rnn_attention_num_hidden=None, rnn_attention_type='dot', rnn_attention_use_prev_word=False, rnn_cell_type='lstm', rnn_context_gating=False, rnn_decoder_hidden_dropout=0.0, rnn_decoder_state_init='last', rnn_dropout_inputs=(0.0, 0.0), rnn_dropout_recurrent=(0.0, 0.0), rnn_dropout_states=(0.0, 0.0), rnn_encoder_reverse_input=False, rnn_first_residual_layer=2, rnn_forget_bias=0.0, rnn_h2h_init='orthogonal', rnn_num_hidden=512, rnn_residual_connections=False, seed=13, shared_vocab=False, source='corpus.tc.BPE.de', source_vocab=None, target='corpus.tc.BPE.en', target_vocab=None, transformer_activation_type='relu', transformer_attention_heads=8, transformer_dropout_act=0.0, transformer_dropout_attention=0.0, transformer_dropout_prepost=0.0, transformer_feed_forward_num_hidden=2048, transformer_model_size=512, transformer_positional_embedding_type='fixed', transformer_postprocess=('drn', 'drn'), transformer_preprocess=('', ''), use_cpu=False, use_tensorboard=False, validation_source='newstest2016.tc.BPE.de', validation_target='newstest2016.tc.BPE.en', weight_decay=0.0, weight_init='xavier', weight_init_scale=2.34, weight_init_xavier_factor_type='in', weight_init_xavier_rand_type='uniform', weight_normalization=False, weight_tying=False, weight_tying_type='trg_softmax', word_min_count=(1, 1)) [INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [-1] [INFO:sockeye.utils] Acquired GPU 0. [INFO:main] Training Device(s): GPU [0] [INFO:sockeye.vocab] Building vocabulary from dataset(s): ['corpus.tc.BPE.de'] [INFO:sockeye.vocab] Vocabulary: types: 22309/22309/22309/22313 (initial/min_pruned/max_pruned/+special) [min_frequency=1, max_num_types=50000] [INFO:sockeye.vocab] Building vocabulary from dataset(s): ['corpus.tc.BPE.en'] [INFO:sockeye.vocab] Vocabulary: types: 16757/16757/16757/16761 (initial/min_pruned/max_pruned/+special) [min_frequency=1, max_num_types=50000] [INFO:sockeye.data_io] =============================== [INFO:sockeye.data_io] Creating training data iterator [INFO:sockeye.data_io] =============================== [INFO:sockeye.data_io] 193369 sequences of maximum length (60, 60) in '/home/chenxiny/newstest/corpus.tc.BPE.de' and '/home/chenxiny/newstest/corpus.tc.BPE.en'. [INFO:sockeye.data_io] Mean training target/source length ratio: 0.99 (+-0.45) [INFO:sockeye.data_io] Tokens: source 4563838 target 4466824 [INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100% [INFO:sockeye.data_io] 193369 sequences across 6 buckets [INFO:sockeye.data_io] 6631 sequences did not fit into buckets and were discarded [INFO:sockeye.data_io] Bucket (10, 10): 32318 samples in 505 batches of 64, ~362.2 tokens/batch. [INFO:sockeye.data_io] Bucket (20, 20): 46870 samples in 733 batches of 64, ~949.5 tokens/batch. [INFO:sockeye.data_io] Bucket (30, 30): 49430 samples in 773 batches of 64, ~1511.9 tokens/batch. [INFO:sockeye.data_io] Bucket (40, 40): 35885 samples in 561 batches of 64, ~2070.4 tokens/batch. [INFO:sockeye.data_io] Bucket (50, 50): 19636 samples in 307 batches of 64, ~2618.3 tokens/batch.

szha commented 6 years ago

@pengzhao-intel any update?

szha commented 6 years ago

Unless intel provides a static library for libiomp5 built with -fPIC flag, or drop the dependency on it, I don't see anything we can do from the mxnet side. Suggestions are welcome.

pengzhao-intel commented 6 years ago

@szha, @ashokei is talking with MKL team for using libgomp in mxnet so it can avoid the dependency of libiomp.

@ashokei any update and please correct me if my understanding is not correct.

szha commented 6 years ago

@pengzhao-intel thanks for the update. @ashokei let us know how things progress and how this change impacts performance.

ashokei commented 6 years ago

@zheng-da is your openmp linking issue related to this in any way; Does your solution/workaround fix this ?

zheng-da commented 6 years ago

no, i don't have a workaround so far.

tdomhan commented 6 years ago

any updates on this?

pengzhao-intel commented 6 years ago

@tdomhan after several investigations, I realized it's not easy to resolve this issue under the current setting. We have to change some building logic.
Do you still encounter this error? Does the workaround KMP_DUPLICATE_LIB_OK=TRUE make sense? I will raise the internal discussion with MKL team and back soon.

tdomhan commented 6 years ago

"may cause crashes or silently produce incorrect results" in the message stated by Felix above made me not so eager to try this out, rather than using the non-MKL version of MXNet.

pengzhao-intel commented 6 years ago

Understand, even we don't encounter the crashes and incorrect results by the workaround. I am working on this item now and hope we can resolve this issue clearly :)

tdomhan commented 6 years ago

thanks a lot for looking into this! :)

pengzhao-intel commented 6 years ago

The numpy from conda includes the mkl package which conflicts with MXNET.
The good news is the latest mkl package resolved this issue from mkl-2018.0.3.

Thus, updating the numpy in conda to 1.14 will resolve this issue.

@tdomhan @fhieber please try again in your environment. Feel free to let me know if you encountered any other issues.

The following packages will be downloaded: package build

**mkl-2018.0.3 1 198.7 MB**

The following NEW packages will be INSTALLED: blas: 1.0-mkl
mkl_fft: 1.0.1-py36h3010b51_0
mkl_random: 1.0.1-py36h629b387_0
numpy-base: 1.14.5-py36hdbf6ddf_0
The following packages will be UPDATED: ca-certificates: 2017.08.26-h1d4fec5_0 --> 2018.03.07-0
certifi: 2017.7.27.1-py36h8b7b77e_0 --> 2018.4.16-py36_0
conda: 4.4.11-py36_0 --> 4.5.4-py36_0
llvmlite: 0.20.0-py36_0 --> 0.23.2-py36hdbcaa40_0 mkl: 2018.0.0-hb491cac_4 --> 2018.0.3-1
numba: 0.35.0-np113py36_10 --> 0.38.1-py36h04863e7_0 numpy: 1.13.3-py36ha12f23b_0 --> 1.14.5-py36hcd700cb_0 openssl: 1.0.2l-h077ae2c_5 --> 1.0.2o-h20670df_0

The following packages will be downloaded: package	build
**mkl-2018.0.3	1 198.7 MB**

pengzhao-intel commented 6 years ago

@fhieber @tdomhan could you help confirm the fix? If the problem is solved, we can close this bug finally ;)

fhieber commented 6 years ago

I finally got back to this after a while. I no longer observe the libomp-related error mentioned in the original issue, but I am observing process deadlocks with the following numpy/mxnet configuration in Sockeye:

conda list | grep numpy
numpy                     1.15.1           py36h6a91979_0
numpy-base                1.15.1           py36h8a80b8c_0
conda list | grep mkl
blas                      1.0                         mkl
mkl                       2019.0                      118
mkl_fft                   1.0.4            py36h5d10147_1
mkl_random                1.0.1            py36h5d10147_1
mxnet-mkl                 1.3.0.post0               <pip>

If mkl-optimized numpy is installed via anaconda (as shown above) and using mxnet-mkl==1.3.0.post0 on a Mac laptop, the Sockeye subprocess spawned at a checkpoint (to decode the validation data set), is unable to spawn and the main process deterministically hangs. When debugging, it seems that it fails to spawn the subprocess. However, when using either mxnet==1.3.0.post0 (no mkl) or pip-installed numpy (no mkl), everything works just fine.

pengzhao-intel commented 6 years ago

@fhieber thanks for the feedback. Could you give me a test case to reproduce this issue?

fhieber commented 6 years ago

Thanks @pengzhao-intel, here is a minimal example to reproduce the issue. You can run this:

conda install mkl
conda install numpy
pip install mxnet-mkl
git clone https://github.com/awslabs/sockeye.git
cd sockeye
python -m sockeye.train --num-layers 1 -s setup.py -t setup.py -vs setup.py -vt setup.py -o test_model --batch-size 5 --batch-type sentence --num-embed 8 --transformer-model-size 8 --overwrite-output --use-cpu --transformer-attention-heads 1 --checkpoint-frequency 100 --decode-and-evaluate 2

(this will train a tiny model on the setup.py file, but will hang once reached 100 updates and spawns a CheckpointDecoder subprocess to decode 2 sentences of the validation data. This will hang with the last log line being:

[INFO:sockeye.training] Starting process: Decoder-1

If you set --decode-and-evaluate 0, no decoder subprocess will be started at each checkpoint, and training runs fine.

If you run

conda uninstall mkl
conda uninstall numpy
pip install numpy

and run the same training with --decode-and-evaluate > 0, no hanging will occur.

Likewise, if you replace mxnet-mkl with mxnet:

pip uninstall mxnet-mkl
pip install mxnet

and run the same training, no hanging will occur.

pengzhao-intel commented 6 years ago

@fhieber thanks for the details information.

Because this is a new issue, do you mind close this one and open a new one? BTW, the issue is observed from sockeye first, I suggest we can create an issue in sockeye repo first and go back MXNet repo when we figure out the specific bug.

fhieber commented 6 years ago

@pengzhao-intel this is not a Sockeye issue. I posted a minimal reproducible example in #12710

lebeg commented 5 years ago

pengzhao-intel commented 5 years ago

The problem has been fixed with the latest numpy in conda.

Closing, feel free to re-open if there is any other issue.

apache / mxnet

mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL #8532