Closed fhieber closed 5 years ago
@szha
@ykim362 what's your recommendation on this? MKL2018 doesn't distribute .a, which means that I can't hide any of its symbols when loading.
@b0noI - how do we resolve the issue on DLAMI?
@szha I am discussing with MKL team to figure out how to handle this situation. I will get back to you, soon.
@fhieber Could you provide your numpy (conda) version?
@szha, Can we link libiomp5.a statically and then strip the symbols from the final libmxnet.so? I am theorizing (I think correctly) that two completely separate versions of openmp can coexist when they apply only to two different modules.
@cjolivier01 how can this work, when the data structures passed to the functions of the library change? This would only work if both versions of openmp would guarantee binary compatibility. Is that the case?
@larroy Not sure what you're trying to say here. If the symbols are stripped from libmxnet.so, then the libnumpy.so (or whatever it's called)'s omp.so and libmxnet.so (with its internal omp pool) won't see each other at all, much less call into each other.
What you'll get is numpy using a separate thread pool than libmxnet.so, which probably doesn't matter much because they generally won't be running in parallel anyway.
@cjolivier01 When I discussed this issue with MKL team, it is not much feasible to make static MKLML given that it has lots of binary dependencies. Another aspect is that there is a possibility that two separate OpenMP calls (from numpy and from MXNet with MKLML) could degrade the performance. We are working on several experimentations. I would like to update the result next week.
@cjolivier01 I see. I thought the problem was linking with the same symbols. I found this related to the issue: https://software.intel.com/en-us/inference-engine-devguide-known-issues
@cjolivier01 From the experimentations, the behavior varies case by case. Some combinations of versions work fine. Some are not. So, the workaround @larroy pointed (LD_PRELOAD env variable) might be a good option. Also, it is recommended to use MKL-enabled MXNet with non-MKL python to get the best performance for neural network workloads.
@ykim362 our goal is to be able to provide good out-of-the-box experience for MKL-enabled mxnet. Though the workaround might be good for experienced users, we cannot control the environment in which mxnet runs for pip users. Thus, I'm still leaning towards the way @cjolivier01 proposed above if possible.
Do you mind sharing more on the dependencies of mklml to help figure out the feasibility of static-linking?
@szha I totally agree with you regarding the OOB experience! Actually, our concern is that static linking doesn't solve the problem. How about trying rpath compile option?
@fhieber Could you provide some more detailed information to re-generate this situation? The behavior varies, so I would like to test with the exactly same environment.
@ykim362 could you elaborate? Currently I'm using '${ORIGIN}' in rpath for mkl builds already, and shipping mkl shared objects along with libmxnet.so. I don't think this prevents other omp from being loaded first and hence doesn't solve our problem. Are you referring to a different solution that's based on rpath?
MKL dev here. Can somebody please verify that the pip packages are not linked to OpenMP (Intel's libiomp5) statically? If any one of them is, then this would be a problem. If every single package links to OpenMP dynamically, this should probably work (disclaimer: this needs to be verified; the libmkl_rt.so on Linux loads libiomp5.so via dlopen and I'm not 100% sure what would happen if the library is loaded multiple times for example using RTLD_LOCAL...)
@rsdubtso Hi. I can verify that mxnet pip package is linked dynamically to libiomp5.so
. We include libiomp5.so
in the pip package using the rpath trick described above.
Hm. Seems like the numpy package also uses dynamic linking to libiomp5. Do you have a small repro that I could try? (Please do include all the necessary package installation steps -- I'm pretty new to all this stuff...)
PS. I also tried installing mxnet for python2.7 conda and it seems to be missing libmxnet.so completely... should I be using conda for python3?
@fhieber Could you provide the conda version you used or some more detailed environment when you got this problem? mxnet-mkl==0.12.0 and 0.12.1 work fine with conda in my tests.
@ykim362 I just tried this again in a Docker image with mxnet-cu90mkl==1.0.0.post4
and the following conda info
output:
platform : linux-64
conda version : 4.3.30
conda is private : False
conda-env version : 4.3.30
conda-build version : not installed
python version : 3.6.3.final.0
requests version : 2.18.4
root environment : /opt/conda (writable)
default environment : /opt/conda
envs directories : /opt/conda/envs
/root/.conda/envs
package cache : /opt/conda/pkgs
/root/.conda/pkgs
channel URLs : https://repo.continuum.io/pkgs/main/linux-64
https://repo.continuum.io/pkgs/main/noarch
https://repo.continuum.io/pkgs/free/linux-64
https://repo.continuum.io/pkgs/free/noarch
https://repo.continuum.io/pkgs/r/linux-64
https://repo.continuum.io/pkgs/r/noarch
https://repo.continuum.io/pkgs/pro/linux-64
https://repo.continuum.io/pkgs/pro/noarch
config file : None
netrc file : None
offline mode : False
user-agent : conda/4.3.30 requests/2.18.4 CPython/3.6.3 Linux/4.9.58-18.51.amzn1.x86_64 / glibc/2.17
UID:GID : 0:0
I still observe the same crash:
> OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
> OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
@fhieber @rsdubtso is correct. It caused by multiple omp lib. The workaround is to set KMP_DUPLICATE_LIB_OK=TRUE .
I am considering the final solution now. Will come back soon :)
Error log:
[INFO:sockeye.training] Using bucketing. Default max_seq_len=(60, 60) [INFO:main] Optimizer: adam [INFO:main] Optimizer Parameters: {'wd': 0.0, 'learning_rate': 0.0003, 'lr_scheduler': LearningRateSchedulerPlateauReduce(reduce_factor=0.50, reduce_num_not_improved=0), 'clip_gradient': 1.0, 'rescale_grad': 1.0} [INFO:main] kvstore: device [INFO:main] Gradient Compression: None [INFO:main] Decode and Evaluate Device(s): cpu(0) [INFO:sockeye.model] Saved config to "/home/chenxiny/newstest/wmt_model/config" OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized. OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. Aborted
After setting the environment, the problem is gone. New log:
[INFO:sockeye.utils] Sockeye version 1.16.2 commit 762ce78e4e49b9ba5d14eb0a48d97f19c8807707 [INFO:sockeye.utils] MXNet version 1.1.0 [INFO:sockeye.utils] Command: /home/chenxiny/anaconda3/lib/python3.6/site-packages/sockeye/train.py -s corpus.tc.BPE.de -t corpus.tc.BPE.en -vs newstest2016.tc.BPE.de -vt newstest2016.tc.BPE.en --num-embed 256 --rnn-num-hidden 512 --rnn-attention-type dot --max-seq-len 60 --decode-and-evaluate 500 -o wmt_model [INFO:sockeye.utils] Arguments: Namespace(allow_missing_params=False, batch_size=64, batch_type='sentence', bucket_width=10, checkpoint_frequency=1000, cnn_activation_type='glu', cnn_hidden_dropout=0.0, cnn_kernel_width=(3, 5), cnn_num_hidden=512, cnn_positional_embedding_type='learned', cnn_project_qkv=False, conv_embed_add_positional_encodings=False, conv_embed_dropout=0.0, conv_embed_max_filter_width=8, conv_embed_num_filters=(200, 200, 250, 250, 300, 300, 300, 300), conv_embed_num_highway_layers=4, conv_embed_output_dim=None, conv_embed_pool_stride=5, decode_and_evaluate=500, decode_and_evaluate_device_id=None, decode_and_evaluate_use_cpu=False, decoder='rnn', device_ids=[-1], disable_device_locking=False, embed_dropout=(0.0, 0.0), embed_weight_init='default', encoder='rnn', fill_up='replicate', gradient_clipping_threshold=1.0, gradient_clipping_type='abs', gradient_compression_threshold=0.5, gradient_compression_type=None, initial_learning_rate=0.0003, keep_last_params=-1, kvstore='device', label_smoothing=0.0, layer_normalization=False, learning_rate_decay_optimizer_states_reset='off', learning_rate_decay_param_reset=False, learning_rate_half_life=10, learning_rate_reduce_factor=0.5, learning_rate_reduce_num_not_improved=3, learning_rate_schedule=None, learning_rate_scheduler_type='plateau-reduce', learning_rate_warmup=0, lock_dir='/tmp', loss='cross-entropy', loss_normalization_type='valid', max_num_checkpoint_not_improved=8, max_num_epochs=None, max_seq_len=(60, 60), max_updates=None, metrics=['perplexity'], min_num_epochs=None, momentum=None, monitor_pattern=None, monitor_stat_func='mx_default', no_bucketing=False, num_embed=(256, 256), num_layers=(1, 1), num_words=(50000, 50000), optimized_metric='perplexity', optimizer='adam', optimizer_params=None, output='wmt_model', overwrite_output=False, params=None, prepared_data=None, quiet=False, rnn_attention_coverage_num_hidden=1, rnn_attention_coverage_type='count', rnn_attention_in_upper_layers=False, rnn_attention_mhdot_heads=None, rnn_attention_num_hidden=None, rnn_attention_type='dot', rnn_attention_use_prev_word=False, rnn_cell_type='lstm', rnn_context_gating=False, rnn_decoder_hidden_dropout=0.0, rnn_decoder_state_init='last', rnn_dropout_inputs=(0.0, 0.0), rnn_dropout_recurrent=(0.0, 0.0), rnn_dropout_states=(0.0, 0.0), rnn_encoder_reverse_input=False, rnn_first_residual_layer=2, rnn_forget_bias=0.0, rnn_h2h_init='orthogonal', rnn_num_hidden=512, rnn_residual_connections=False, seed=13, shared_vocab=False, source='corpus.tc.BPE.de', source_vocab=None, target='corpus.tc.BPE.en', target_vocab=None, transformer_activation_type='relu', transformer_attention_heads=8, transformer_dropout_act=0.0, transformer_dropout_attention=0.0, transformer_dropout_prepost=0.0, transformer_feed_forward_num_hidden=2048, transformer_model_size=512, transformer_positional_embedding_type='fixed', transformer_postprocess=('drn', 'drn'), transformer_preprocess=('', ''), use_cpu=False, use_tensorboard=False, validation_source='newstest2016.tc.BPE.de', validation_target='newstest2016.tc.BPE.en', weight_decay=0.0, weight_init='xavier', weight_init_scale=2.34, weight_init_xavier_factor_type='in', weight_init_xavier_rand_type='uniform', weight_normalization=False, weight_tying=False, weight_tying_type='trg_softmax', word_min_count=(1, 1)) [INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [-1] [INFO:sockeye.utils] Acquired GPU 0. [INFO:main] Training Device(s): GPU [0] [INFO:sockeye.vocab] Building vocabulary from dataset(s): ['corpus.tc.BPE.de'] [INFO:sockeye.vocab] Vocabulary: types: 22309/22309/22309/22313 (initial/min_pruned/max_pruned/+special) [min_frequency=1, max_num_types=50000] [INFO:sockeye.vocab] Building vocabulary from dataset(s): ['corpus.tc.BPE.en'] [INFO:sockeye.vocab] Vocabulary: types: 16757/16757/16757/16761 (initial/min_pruned/max_pruned/+special) [min_frequency=1, max_num_types=50000] [INFO:sockeye.data_io] =============================== [INFO:sockeye.data_io] Creating training data iterator [INFO:sockeye.data_io] =============================== [INFO:sockeye.data_io] 193369 sequences of maximum length (60, 60) in '/home/chenxiny/newstest/corpus.tc.BPE.de' and '/home/chenxiny/newstest/corpus.tc.BPE.en'. [INFO:sockeye.data_io] Mean training target/source length ratio: 0.99 (+-0.45) [INFO:sockeye.data_io] Tokens: source 4563838 target 4466824 [INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100% [INFO:sockeye.data_io] 193369 sequences across 6 buckets [INFO:sockeye.data_io] 6631 sequences did not fit into buckets and were discarded [INFO:sockeye.data_io] Bucket (10, 10): 32318 samples in 505 batches of 64, ~362.2 tokens/batch. [INFO:sockeye.data_io] Bucket (20, 20): 46870 samples in 733 batches of 64, ~949.5 tokens/batch. [INFO:sockeye.data_io] Bucket (30, 30): 49430 samples in 773 batches of 64, ~1511.9 tokens/batch. [INFO:sockeye.data_io] Bucket (40, 40): 35885 samples in 561 batches of 64, ~2070.4 tokens/batch. [INFO:sockeye.data_io] Bucket (50, 50): 19636 samples in 307 batches of 64, ~2618.3 tokens/batch.
@pengzhao-intel any update?
Unless intel provides a static library for libiomp5 built with -fPIC flag, or drop the dependency on it, I don't see anything we can do from the mxnet side. Suggestions are welcome.
@szha, @ashokei is talking with MKL team for using libgomp in mxnet so it can avoid the dependency of libiomp.
@ashokei any update and please correct me if my understanding is not correct.
@pengzhao-intel thanks for the update. @ashokei let us know how things progress and how this change impacts performance.
@zheng-da is your openmp linking issue related to this in any way; Does your solution/workaround fix this ?
no, i don't have a workaround so far.
any updates on this?
@tdomhan after several investigations, I realized it's not easy to resolve this issue under the current setting. We have to change some building logic.
Do you still encounter this error? Does the workaround KMP_DUPLICATE_LIB_OK=TRUE
make sense?
I will raise the internal discussion with MKL team and back soon.
"may cause crashes or silently produce incorrect results" in the message stated by Felix above made me not so eager to try this out, rather than using the non-MKL version of MXNet.
Understand, even we don't encounter the crashes and incorrect results by the workaround. I am working on this item now and hope we can resolve this issue clearly :)
thanks a lot for looking into this! :)
The numpy from conda includes the mkl package which conflicts with MXNET.
The good news is the latest mkl package resolved this issue from mkl-2018.0.3.
Thus, updating the numpy in conda to 1.14 will resolve this issue.
@tdomhan @fhieber please try again in your environment. Feel free to let me know if you encountered any other issues.
The following packages will be downloaded: package build **mkl-2018.0.3 1 198.7 MB** The following NEW packages will be INSTALLED: blas: 1.0-mkl
mkl_fft: 1.0.1-py36h3010b51_0
mkl_random: 1.0.1-py36h629b387_0
numpy-base: 1.14.5-py36hdbf6ddf_0
The following packages will be UPDATED: ca-certificates: 2017.08.26-h1d4fec5_0 --> 2018.03.07-0
certifi: 2017.7.27.1-py36h8b7b77e_0 --> 2018.4.16-py36_0
conda: 4.4.11-py36_0 --> 4.5.4-py36_0
llvmlite: 0.20.0-py36_0 --> 0.23.2-py36hdbcaa40_0 mkl: 2018.0.0-hb491cac_4 --> 2018.0.3-1
numba: 0.35.0-np113py36_10 --> 0.38.1-py36h04863e7_0 numpy: 1.13.3-py36ha12f23b_0 --> 1.14.5-py36hcd700cb_0 openssl: 1.0.2l-h077ae2c_5 --> 1.0.2o-h20670df_0
@fhieber @tdomhan could you help confirm the fix? If the problem is solved, we can close this bug finally ;)
I finally got back to this after a while. I no longer observe the libomp-related error mentioned in the original issue, but I am observing process deadlocks with the following numpy/mxnet configuration in Sockeye:
conda list | grep numpy
numpy 1.15.1 py36h6a91979_0
numpy-base 1.15.1 py36h8a80b8c_0
conda list | grep mkl
blas 1.0 mkl
mkl 2019.0 118
mkl_fft 1.0.4 py36h5d10147_1
mkl_random 1.0.1 py36h5d10147_1
mxnet-mkl 1.3.0.post0 <pip>
If mkl-optimized numpy is installed via anaconda (as shown above) and using mxnet-mkl==1.3.0.post0 on a Mac laptop, the Sockeye subprocess spawned at a checkpoint (to decode the validation data set), is unable to spawn and the main process deterministically hangs. When debugging, it seems that it fails to spawn the subprocess. However, when using either mxnet==1.3.0.post0 (no mkl) or pip-installed numpy (no mkl), everything works just fine.
@fhieber thanks for the feedback. Could you give me a test case to reproduce this issue?
Thanks @pengzhao-intel, here is a minimal example to reproduce the issue. You can run this:
conda install mkl
conda install numpy
pip install mxnet-mkl
git clone https://github.com/awslabs/sockeye.git
cd sockeye
python -m sockeye.train --num-layers 1 -s setup.py -t setup.py -vs setup.py -vt setup.py -o test_model --batch-size 5 --batch-type sentence --num-embed 8 --transformer-model-size 8 --overwrite-output --use-cpu --transformer-attention-heads 1 --checkpoint-frequency 100 --decode-and-evaluate 2
(this will train a tiny model on the setup.py file, but will hang once reached 100 updates and spawns a CheckpointDecoder subprocess to decode 2 sentences of the validation data. This will hang with the last log line being:
[INFO:sockeye.training] Starting process: Decoder-1
If you set --decode-and-evaluate 0
, no decoder subprocess will be started at each checkpoint, and training runs fine.
If you run
conda uninstall mkl
conda uninstall numpy
pip install numpy
and run the same training with --decode-and-evaluate > 0
, no hanging will occur.
Likewise, if you replace mxnet-mkl with mxnet:
pip uninstall mxnet-mkl
pip install mxnet
and run the same training, no hanging will occur.
@fhieber thanks for the details information.
Because this is a new issue, do you mind close this one and open a new one? BTW, the issue is observed from sockeye first, I suggest we can create an issue in sockeye repo first and go back MXNet repo when we figure out the specific bug.
@pengzhao-intel this is not a Sockeye issue. I posted a minimal reproducible example in #12710
Possibly related https://github.com/apache/incubator-mxnet/pull/12160
The problem has been fixed with the latest numpy in conda.
Closing, feel free to re-open if there is any other issue.
We have observed crashes with any mkl-enabled pip package of mxnet-0.12.0 in combination with numpy if installed through conda (which by default also uses MKL).
In this case, mxnet trainings crash with the following error message:
Numpy from conda links against the libmkl_rt.so, distributed through conda:
whereas MXNet links to its own .so:
This prevents people from using numpy w/ MKL in combination with mxnet-mkl==0.12.0.