apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.79k stars 6.79k forks source link

[Sockeye] Cannot acquire GPU 0 #20469

Open ekdnam opened 3 years ago

ekdnam commented 3 years ago

Description

(A clear and concise description of what the bug is.) (Note: Original issue on Sockeye) I am currently following this tutorial on Zero-Shot Translation, the notebook (on Google Colab) can be viewed here

In the training step, for some reason, Sockeye is not able to acquire a GPU

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.) The following output is seen repetitively

[INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [0]
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 62s.
[INFO:sockeye.utils] Releasing GPU 0.

The entire output is this (I have to interrupt the execution of the kernel)

[INFO:sockeye.utils] Sockeye version 2.3.17, commit ef908e3c5751ef072b2554f327f8081e935d9731, path /usr/local/lib/python3.7/dist-packages/sockeye/__init__.py
[INFO:sockeye.utils] MXNet version 1.8.0, path /usr/local/lib/python3.7/dist-packages/mxnet/__init__.py
[INFO:sockeye.utils] Command: /usr/local/lib/python3.7/dist-packages/sockeye/train.py -d train_data -vs data/valid.tag.src -vt data/valid.tag.trg --shared-vocab --weight-tying-type src_trg_softmax --device-ids 0 --decode-and-evaluate-device-id 0 -o iwslt_model --max-num-epochs 50
[INFO:sockeye.utils] Arguments: Namespace(allow_missing_params=False, amp=False, amp_scale_interval=2000, batch_sentences_multiple_of=8, batch_size=4096, batch_type='word', bucket_scaling=False, bucket_width=8, cache_last_best_params=0, cache_metric='perplexity', cache_strategy='best', checkpoint_improvement_threshold=0.0, checkpoint_interval=4000, config=None, decode_and_evaluate=500, decode_and_evaluate_device_id=0, decoder='transformer', device_ids=[0], disable_device_locking=False, dry_run=False, dtype='float32', embed_dropout=(0.0, 0.0), encoder='transformer', env=None, fixed_param_names=[], fixed_param_strategy=None, gradient_clipping_threshold=1.0, gradient_clipping_type='none', horovod=False, ignore_extra_params=False, initial_learning_rate=0.0002, keep_initializations=False, keep_last_params=-1, kvstore='device', label_smoothing=0.1, learning_rate_reduce_factor=0.9, learning_rate_reduce_num_not_improved=8, learning_rate_scheduler_type='plateau-reduce', learning_rate_t_scale=1.0, learning_rate_warmup=0, length_task=None, length_task_layers=1, length_task_weight=1.0, lhuc=None, lock_dir='/tmp', loglevel='INFO', loglevel_secondary_workers='INFO', loss='cross-entropy-without-softmax-output', max_checkpoints=None, max_num_checkpoint_not_improved=None, max_num_epochs=50, max_samples=None, max_seconds=None, max_seq_len=(95, 95), max_updates=None, min_num_epochs=None, min_samples=None, min_updates=None, momentum=None, monitor_pattern=None, monitor_stat_func='mx_default', no_bucket_scaling=None, no_bucketing=False, no_hybridization=False, no_logfile=False, num_embed=(None, None), num_layers=(6, 6), num_words=(0, 0), omp_num_threads=None, optimized_metric='perplexity', optimizer='adam', optimizer_params=None, output='iwslt_model', overwrite_output=False, pad_vocab_to_multiple_of=None, params=None, prepared_data='train_data', quiet=False, quiet_secondary_workers=False, round_batch_sizes_to_multiple_of=None, seed=1, shared_vocab=True, source=None, source_factor_vocabs=[], source_factors=[], source_factors_combine=[], source_factors_num_embed=[], source_factors_share_embedding=[], source_factors_use_source_vocab=[], source_vocab=None, stop_training_on_decoder_failure=False, target=None, target_factor_vocabs=[], target_factors=[], target_factors_combine=[], target_factors_num_embed=[], target_factors_share_embedding=[], target_factors_use_target_vocab=[], target_factors_weight=[1.0], target_vocab=None, transformer_activation_type=('relu', 'relu'), transformer_attention_heads=(8, 8), transformer_dropout_act=(0.1, 0.1), transformer_dropout_attention=(0.1, 0.1), transformer_dropout_prepost=(0.1, 0.1), transformer_feed_forward_num_hidden=(2048, 2048), transformer_feed_forward_use_glu=False, transformer_model_size=(512, 512), transformer_positional_embedding_type='fixed', transformer_postprocess=('dr', 'dr'), transformer_preprocess=('n', 'n'), update_interval=1, use_cpu=False, validation_source='data/valid.tag.src', validation_source_factors=[], validation_target='data/valid.tag.trg', validation_target_factors=[], weight_decay=0.0, weight_init='xavier', weight_init_scale=3.0, weight_init_xavier_factor_type='avg', weight_init_xavier_rand_type='uniform', weight_tying_type='src_trg_softmax', word_min_count=(1, 1))
[INFO:__main__] Adjusting maximum length to reserve space for a BOS/EOS marker. New maximum length: (96, 96)
[INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [0]
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Acquired GPU 0.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:__main__] Training Device(s): gpu(0)
[INFO:sockeye.utils] Random seed: 1
[INFO:sockeye.data_io] ===============================
[INFO:sockeye.data_io] Creating training data iterator
[INFO:sockeye.data_io] ===============================
[INFO:sockeye.vocab] Vocabulary (29226 words) loaded from "train_data/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary (29226 words) loaded from "train_data/vocab.trg.0.json"
[INFO:sockeye.data_io] Tokens: source 10662257 target 10504428
[INFO:sockeye.data_io] Number of <unk> tokens: source 0 target 0
[INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100%
[INFO:sockeye.data_io] 442550 sequences across 12 buckets
[INFO:sockeye.data_io] 2395 sequences did not fit into buckets and were discarded
[INFO:sockeye.data_io] Bucket (8, 8): 19596 samples in 33 batches of 600, ~4073.8 target tokens/batch, trg/src length ratio: 1.06 (+-0.17)
[INFO:sockeye.data_io] Bucket (16, 16): 130903 samples in 381 batches of 344, ~4140.7 target tokens/batch, trg/src length ratio: 1.03 (+-0.19)
[INFO:sockeye.data_io] Bucket (24, 24): 112306 samples in 520 batches of 216, ~4083.3 target tokens/batch, trg/src length ratio: 1.01 (+-0.19)
[INFO:sockeye.data_io] Bucket (32, 32): 71702 samples in 472 batches of 152, ~4011.4 target tokens/batch, trg/src length ratio: 1.00 (+-0.17)
[INFO:sockeye.data_io] Bucket (40, 40): 44421 samples in 371 batches of 120, ~4060.2 target tokens/batch, trg/src length ratio: 0.99 (+-0.16)
[INFO:sockeye.data_io] Bucket (48, 48): 25989 samples in 271 batches of 96, ~3964.7 target tokens/batch, trg/src length ratio: 0.99 (+-0.15)
[INFO:sockeye.data_io] Bucket (56, 56): 15214 samples in 173 batches of 88, ~4277.7 target tokens/batch, trg/src length ratio: 0.98 (+-0.15)
[INFO:sockeye.data_io] Bucket (64, 64): 8995 samples in 125 batches of 72, ~4035.0 target tokens/batch, trg/src length ratio: 0.97 (+-0.14)
[INFO:sockeye.data_io] Bucket (72, 72): 5432 samples in 85 batches of 64, ~4055.4 target tokens/batch, trg/src length ratio: 0.97 (+-0.14)
[INFO:sockeye.data_io] Bucket (80, 80): 3432 samples in 62 batches of 56, ~3954.0 target tokens/batch, trg/src length ratio: 0.96 (+-0.14)
[INFO:sockeye.data_io] Bucket (88, 88): 2552 samples in 46 batches of 56, ~4399.1 target tokens/batch, trg/src length ratio: 0.97 (+-0.14)
[INFO:sockeye.data_io] Bucket (96, 96): 2008 samples in 32 batches of 64, ~5529.5 target tokens/batch, trg/src length ratio: 0.98 (+-0.13)
[INFO:sockeye.data_io] Loading shard train_data/shard.00000.
[INFO:sockeye.data_io] =================================
[INFO:sockeye.data_io] Creating validation data iterator
[INFO:sockeye.data_io] =================================
[INFO:sockeye.data_io] 1802 sequences of maximum length (96, 96) in '/content/data/valid.tag.src' and '/content/data/valid.tag.trg'.
[INFO:sockeye.data_io] Mean training target/source length ratio: 1.02 (+-0.18)
[INFO:sockeye.data_io] Tokens: source 46171 target 45973
[INFO:sockeye.data_io] Number of <unk> tokens: source 2 target 2
[INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100%
[INFO:sockeye.data_io] 1802 sequences across 12 buckets
[INFO:sockeye.data_io] 15 sequences did not fit into buckets and were discarded
[INFO:sockeye.data_io] Bucket (8, 8): 61 samples in 1 batches of 600, ~4073.8 target tokens/batch, trg/src length ratio: 1.10 (+-0.19)
[INFO:sockeye.data_io] Bucket (16, 16): 448 samples in 2 batches of 344, ~4140.7 target tokens/batch, trg/src length ratio: 1.02 (+-0.19)
[INFO:sockeye.data_io] Bucket (24, 24): 451 samples in 3 batches of 216, ~4083.3 target tokens/batch, trg/src length ratio: 1.01 (+-0.19)
[INFO:sockeye.data_io] Bucket (32, 32): 359 samples in 3 batches of 152, ~4011.4 target tokens/batch, trg/src length ratio: 1.01 (+-0.17)
[INFO:sockeye.data_io] Bucket (40, 40): 178 samples in 2 batches of 120, ~4060.2 target tokens/batch, trg/src length ratio: 1.02 (+-0.19)
[INFO:sockeye.data_io] Bucket (48, 48): 124 samples in 2 batches of 96, ~3964.7 target tokens/batch, trg/src length ratio: 1.04 (+-0.17)
[INFO:sockeye.data_io] Bucket (56, 56): 66 samples in 1 batches of 88, ~4277.7 target tokens/batch, trg/src length ratio: 0.99 (+-0.16)
[INFO:sockeye.data_io] Bucket (64, 64): 46 samples in 1 batches of 72, ~4035.0 target tokens/batch, trg/src length ratio: 0.98 (+-0.14)
[INFO:sockeye.data_io] Bucket (72, 72): 23 samples in 1 batches of 64, ~4055.4 target tokens/batch, trg/src length ratio: 0.99 (+-0.11)
[INFO:sockeye.data_io] Bucket (80, 80): 25 samples in 1 batches of 56, ~3954.0 target tokens/batch, trg/src length ratio: 0.99 (+-0.12)
[INFO:sockeye.data_io] Bucket (88, 88): 11 samples in 1 batches of 56, ~4399.1 target tokens/batch, trg/src length ratio: 0.91 (+-0.08)
[INFO:sockeye.data_io] Bucket (96, 96): 10 samples in 1 batches of 64, ~5529.5 target tokens/batch, trg/src length ratio: 1.02 (+-0.10)
[INFO:sockeye.data_io] Created bucketed parallel data set. Introduced padding: source=16.5% target=16.8%)
[INFO:sockeye.vocab] Vocabulary saved to "/content/iwslt_model/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary saved to "/content/iwslt_model/vocab.trg.0.json"
[INFO:__main__] Vocabulary sizes: source=[29226] target=[29226]
[INFO:__main__] Source embedding size was not set it will automatically be adjusted to match the Transformer source model size (512).
[INFO:__main__] Target embedding size was not set it will automatically be adjusted to match the Transformer target model size (512).
[INFO:sockeye.model] ModelConfig(config_data=DataConfig(data_statistics=DataStatistics(num_sents=442550, num_discarded=2395, num_tokens_source=10662257, num_tokens_target=10504428, num_unks_source=0, num_unks_target=0, max_observed_len_source=96, max_observed_len_target=96, size_vocab_source=29226, size_vocab_target=29226, length_ratio_mean=1.0102479170496759, length_ratio_std=0.1793994964134372, buckets=[(8, 8), (16, 16), (24, 24), (32, 32), (40, 40), (48, 48), (56, 56), (64, 64), (72, 72), (80, 80), (88, 88), (96, 96)], num_sents_per_bucket=[19596, 130903, 112306, 71702, 44421, 25989, 15214, 8995, 5432, 3432, 2552, 2008], average_len_target_per_bucket=[6.7895999183506675, 12.036798239918186, 18.90396773102061, 26.39061671919884, 33.83516805114739, 41.298857208818845, 48.609767319573955, 56.0414674819342, 63.36616347569953, 70.6069347319347, 78.55446708463948, 86.39840637450203], length_ratio_stats_per_bucket=[(1.0629691798131742, 0.17423598623300557), (1.031108907820037, 0.1930931812097123), (1.0103493851348708, 0.1895598677461141), (0.9976084216183412, 0.1697442200761039), (0.9910226967134623, 0.15734516229462056), (0.985247167251298, 0.1495865401607151), (0.9798646721353192, 0.1515138948656466), (0.9724385239790109, 0.13814620473774203), (0.9676546696165079, 0.1372053441005369), (0.963085252277705, 0.14170991805183172), (0.9701350918651122, 0.14106125840309341), (0.9767222098806323, 0.13310340059289394)]), max_seq_len_source=96, max_seq_len_target=96, num_source_factors=1, num_target_factors=1), vocab_source_size=29226, vocab_target_size=29226, config_embed_source=EmbeddingConfig(vocab_size=29226, num_embed=512, dropout=0.0, num_factors=1, factor_configs=None, allow_sparse_grad=True), config_embed_target=EmbeddingConfig(vocab_size=29226, num_embed=512, dropout=0.0, num_factors=1, factor_configs=None, allow_sparse_grad=True), config_encoder=TransformerConfig(model_size=512, attention_heads=8, feed_forward_num_hidden=2048, act_type='relu', num_layers=6, dropout_attention=0.1, dropout_act=0.1, dropout_prepost=0.1, positional_embedding_type='fixed', preprocess_sequence='n', postprocess_sequence='dr', max_seq_len_source=96, max_seq_len_target=96, decoder_type='transformer', use_lhuc=False, depth_key_value=0, use_glu=False), config_decoder=TransformerConfig(model_size=512, attention_heads=8, feed_forward_num_hidden=2048, act_type='relu', num_layers=6, dropout_attention=0.1, dropout_act=0.1, dropout_prepost=0.1, positional_embedding_type='fixed', preprocess_sequence='n', postprocess_sequence='dr', max_seq_len_source=96, max_seq_len_target=96, decoder_type='transformer', use_lhuc=False, depth_key_value=512, use_glu=False), config_length_task=None, weight_tying_type='src_trg_softmax', lhuc=False, dtype='float32', intgemm_custom_lib='/usr/local/lib/python3.7/dist-packages/sockeye/libintgemm.so')
[INFO:sockeye.lr_scheduler] Will reduce the learning rate by a factor of 0.90 whenever the validation score doesn't improve 8 times.
[INFO:__main__] Optimizer: adam | kvstore=device | params={'wd': 0.0, 'learning_rate': 0.0002, 'rescale_grad': 1.0, 'lr_scheduler': LearningRateSchedulerPlateauReduce(reduce_factor=0.90, reduce_num_not_improved=8, num_not_improved=0, base_lr=None, lr=None, warmup=0, warmed_up=True)} | initializer=<mxnet.initializer.Xavier object at 0x7f7d08674cd0>
[INFO:__main__] Gradient accumulation over 1 batch(es) by 1 worker(s). Effective batch size: 4096
[INFO:sockeye.utils] # of parameters: 59194922 | trainable: 59096618 (99.83%) | fixed: 98304 (0.17%)
[INFO:sockeye.utils] Trainable parameters: 
['decoder_transformer_0_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_0_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_0_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_0_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_0_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_0_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_0_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_0_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_0_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_0_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_0_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_0_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_0_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_0_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_0_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_1_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_1_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_1_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_1_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_1_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_1_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_1_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_1_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_1_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_1_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_1_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_1_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_1_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_1_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_1_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_2_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_2_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_2_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_2_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_2_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_2_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_2_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_2_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_2_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_2_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_2_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_2_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_2_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_2_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_2_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_3_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_3_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_3_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_3_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_3_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_3_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_3_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_3_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_3_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_3_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_3_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_3_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_3_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_3_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_3_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_4_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_4_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_4_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_4_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_4_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_4_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_4_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_4_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_4_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_4_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_4_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_4_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_4_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_4_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_4_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_5_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_5_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_5_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_5_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_5_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_5_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_5_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_5_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_5_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_5_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_5_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_5_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_5_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_5_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_5_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_final_process_norm_beta [(512,), float32]',
 'decoder_transformer_final_process_norm_gamma [(512,), float32]',
 'encoder_transformer_0_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_0_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_0_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_0_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_0_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_0_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_0_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_0_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_0_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_0_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_1_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_1_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_1_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_1_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_1_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_1_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_1_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_1_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_1_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_1_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_2_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_2_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_2_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_2_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_2_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_2_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_2_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_2_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_2_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_2_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_3_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_3_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_3_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_3_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_3_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_3_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_3_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_3_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_3_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_3_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_4_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_4_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_4_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_4_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_4_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_4_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_4_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_4_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_4_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_4_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_5_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_5_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_5_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_5_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_5_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_5_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_5_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_5_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_5_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_5_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_final_process_norm_beta [(512,), float32]',
 'encoder_transformer_final_process_norm_gamma [(512,), float32]',
 'source_target_embed_weight [(29226, 512), float32]',
 'target_output_bias [(29226,), float32]']
[INFO:sockeye.utils] Fixed parameters:
['decoder_transformer_target_pos_embed_weight [(96, 512), float32]',
 'encoder_transformer_source_pos_embed_weight [(96, 512), float32]']
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
[INFO:sockeye.loss] Loss: cross-entropy | weight=1.00 | metric: perplexity (ppl) | output_name: 'logits' | label_name: 'target_label'
[INFO:sockeye.training] Logging training events for Tensorboard at 'iwslt_model/tensorboard'
[INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [0]
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 62s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 37s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 17s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 69s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 13s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 32s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 51s.
[INFO:sockeye.utils] Releasing GPU 0.
[ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/sockeye/train.py", line 1149, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/sockeye/train.py", line 906, in main
    train(args)
  File "/usr/local/lib/python3.7/dist-packages/sockeye/train.py", line 1142, in train
    training_model, source_vocabs, target_vocabs, hybridize=hybridize)
  File "/usr/local/lib/python3.7/dist-packages/sockeye/train.py", line 225, in create_checkpoint_decoder
    exit_stack=exit_stack)[0]
  File "/usr/local/lib/python3.7/dist-packages/sockeye/utils.py", line 335, in determine_context
    context = exit_stack.enter_context(acquire_gpus(device_ids, lock_dir=lock_dir))
  File "/usr/lib/python3.7/contextlib.py", line 427, in enter_context
    result = _cm_type.__enter__(cm)
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/dist-packages/sockeye/utils.py", line 470, in acquire_gpus
    time.sleep(retry_wait_actual)
KeyboardInterrupt
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 587, in _notify_shutdown
    check_call(_LIB.MXNotifyShutdown())
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../src/common/random_generator.cu", line 58
Name: Check failed: err == cudaSuccess (209 vs. 0) : rand_generator_seed_kernel ErrStr:no kernel image is available for execution on the device

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Execute the Colab notebook, which can be viewed here

Environment

We recommend using our script for collecting the diagnostic information with the following command curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3

Environment Information # Paste the diagnose.py command output here ----------Python Info---------- Version : 3.7.11 Compiler : GCC 7.5.0 Build : ('default', 'Jul 3 2021 18:01:19') Arch : ('64bit', '') ------------Pip Info----------- Version : 21.1.3 Directory : /usr/local/lib/python3.7/dist-packages/pip ----------MXNet Info----------- Version : 1.8.0 Directory : /usr/local/lib/python3.7/dist-packages/mxnet Commit hash file "/usr/local/lib/python3.7/dist-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source. Library : ['/usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so'] Build features: ✔ CUDA ✔ CUDNN ✔ NCCL ✔ CUDA_RTC ✖ TENSORRT ✔ CPU_SSE ✔ CPU_SSE2 ✔ CPU_SSE3 ✖ CPU_SSE4_1 ✖ CPU_SSE4_2 ✖ CPU_SSE4A ✖ CPU_AVX ✖ CPU_AVX2 ✔ OPENMP ✖ SSE ✖ F16C ✖ JEMALLOC ✔ BLAS_OPEN ✖ BLAS_ATLAS ✖ BLAS_MKL ✖ BLAS_APPLE ✔ LAPACK ✔ MKLDNN ✔ OPENCV ✖ CAFFE ✖ PROFILER ✔ DIST_KVSTORE ✖ CXX14 ✖ INT64_TENSOR_SIZE ✔ SIGNAL_HANDLER ✖ DEBUG ✖ TVM_OP ----------System Info---------- Platform : Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic system : Linux node : aa12ac7e34fe release : 5.4.104+ version : #1 SMP Sat Jun 5 09:50:34 PDT 2021 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU @ 2.30GHz Stepping: 0 CPU MHz: 2299.998 BogoMIPS: 4599.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0,1 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0022 sec, LOAD: 0.4217 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0237 sec, LOAD: 0.0693 sec. Error open Gluon Tutorial(cn): https://zh.gluon.ai, , DNS finished in 0.029986143112182617 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0177 sec, LOAD: 0.6124 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0109 sec, LOAD: 0.0909 sec. Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.01166844367980957 sec. ----------Environment---------- KMP_DUPLICATE_LIB_OK="True" KMP_INIT_AT_FORK="FALSE"

The files mentioned in the notebook (taken from the aforementioned tutorial) can be viewed here.

  1. installation.sh
  2. download_and_move.sh
  3. preprocess.sh
  4. prepare_data.sh
  5. train.sh
ekdnam commented 3 years ago

cc @tdomhan

TristonC commented 3 years ago

@blchu Maybe you can help to take a look too.

lusalini commented 3 years ago

Hey there! I have the same problem. Strangely, it worked three months ago on the same docker image and now I'm getting this error. I have mxnet-cu101==1.7.0 installed and running sockeye from inside an ubuntu docker container. I tried reinstalling the dependencies, rebuilding the image, reboot the PC, but nothing seems to work. Is there a solution for this issue?