awslabs / sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
https://awslabs.github.io/sockeye/
Apache License 2.0
1.21k stars 323 forks source link

Cannot acquire GPU 0 #955

Closed ekdnam closed 3 years ago

ekdnam commented 3 years ago

I am currently following this tutorial on Zero-Shot Translation, the notebook (on Google Colab) can be viewed here

In the training step, for some reason, Sockeye is not able to acquire a GPU

[INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [0]
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 62s.
[INFO:sockeye.utils] Releasing GPU 0.

The entire output is this (I have to interrupt the execution of the kernel)

[INFO:sockeye.utils] Sockeye version 2.3.17, commit ef908e3c5751ef072b2554f327f8081e935d9731, path /usr/local/lib/python3.7/dist-packages/sockeye/__init__.py
[INFO:sockeye.utils] MXNet version 1.8.0, path /usr/local/lib/python3.7/dist-packages/mxnet/__init__.py
[INFO:sockeye.utils] Command: /usr/local/lib/python3.7/dist-packages/sockeye/train.py -d train_data -vs data/valid.tag.src -vt data/valid.tag.trg --shared-vocab --weight-tying-type src_trg_softmax --device-ids 0 --decode-and-evaluate-device-id 0 -o iwslt_model --max-num-epochs 50
[INFO:sockeye.utils] Arguments: Namespace(allow_missing_params=False, amp=False, amp_scale_interval=2000, batch_sentences_multiple_of=8, batch_size=4096, batch_type='word', bucket_scaling=False, bucket_width=8, cache_last_best_params=0, cache_metric='perplexity', cache_strategy='best', checkpoint_improvement_threshold=0.0, checkpoint_interval=4000, config=None, decode_and_evaluate=500, decode_and_evaluate_device_id=0, decoder='transformer', device_ids=[0], disable_device_locking=False, dry_run=False, dtype='float32', embed_dropout=(0.0, 0.0), encoder='transformer', env=None, fixed_param_names=[], fixed_param_strategy=None, gradient_clipping_threshold=1.0, gradient_clipping_type='none', horovod=False, ignore_extra_params=False, initial_learning_rate=0.0002, keep_initializations=False, keep_last_params=-1, kvstore='device', label_smoothing=0.1, learning_rate_reduce_factor=0.9, learning_rate_reduce_num_not_improved=8, learning_rate_scheduler_type='plateau-reduce', learning_rate_t_scale=1.0, learning_rate_warmup=0, length_task=None, length_task_layers=1, length_task_weight=1.0, lhuc=None, lock_dir='/tmp', loglevel='INFO', loglevel_secondary_workers='INFO', loss='cross-entropy-without-softmax-output', max_checkpoints=None, max_num_checkpoint_not_improved=None, max_num_epochs=50, max_samples=None, max_seconds=None, max_seq_len=(95, 95), max_updates=None, min_num_epochs=None, min_samples=None, min_updates=None, momentum=None, monitor_pattern=None, monitor_stat_func='mx_default', no_bucket_scaling=None, no_bucketing=False, no_hybridization=False, no_logfile=False, num_embed=(None, None), num_layers=(6, 6), num_words=(0, 0), omp_num_threads=None, optimized_metric='perplexity', optimizer='adam', optimizer_params=None, output='iwslt_model', overwrite_output=False, pad_vocab_to_multiple_of=None, params=None, prepared_data='train_data', quiet=False, quiet_secondary_workers=False, round_batch_sizes_to_multiple_of=None, seed=1, shared_vocab=True, source=None, source_factor_vocabs=[], source_factors=[], source_factors_combine=[], source_factors_num_embed=[], source_factors_share_embedding=[], source_factors_use_source_vocab=[], source_vocab=None, stop_training_on_decoder_failure=False, target=None, target_factor_vocabs=[], target_factors=[], target_factors_combine=[], target_factors_num_embed=[], target_factors_share_embedding=[], target_factors_use_target_vocab=[], target_factors_weight=[1.0], target_vocab=None, transformer_activation_type=('relu', 'relu'), transformer_attention_heads=(8, 8), transformer_dropout_act=(0.1, 0.1), transformer_dropout_attention=(0.1, 0.1), transformer_dropout_prepost=(0.1, 0.1), transformer_feed_forward_num_hidden=(2048, 2048), transformer_feed_forward_use_glu=False, transformer_model_size=(512, 512), transformer_positional_embedding_type='fixed', transformer_postprocess=('dr', 'dr'), transformer_preprocess=('n', 'n'), update_interval=1, use_cpu=False, validation_source='data/valid.tag.src', validation_source_factors=[], validation_target='data/valid.tag.trg', validation_target_factors=[], weight_decay=0.0, weight_init='xavier', weight_init_scale=3.0, weight_init_xavier_factor_type='avg', weight_init_xavier_rand_type='uniform', weight_tying_type='src_trg_softmax', word_min_count=(1, 1))
[INFO:__main__] Adjusting maximum length to reserve space for a BOS/EOS marker. New maximum length: (96, 96)
[INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [0]
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Acquired GPU 0.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:__main__] Training Device(s): gpu(0)
[INFO:sockeye.utils] Random seed: 1
[INFO:sockeye.data_io] ===============================
[INFO:sockeye.data_io] Creating training data iterator
[INFO:sockeye.data_io] ===============================
[INFO:sockeye.vocab] Vocabulary (29226 words) loaded from "train_data/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary (29226 words) loaded from "train_data/vocab.trg.0.json"
[INFO:sockeye.data_io] Tokens: source 10662257 target 10504428
[INFO:sockeye.data_io] Number of <unk> tokens: source 0 target 0
[INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100%
[INFO:sockeye.data_io] 442550 sequences across 12 buckets
[INFO:sockeye.data_io] 2395 sequences did not fit into buckets and were discarded
[INFO:sockeye.data_io] Bucket (8, 8): 19596 samples in 33 batches of 600, ~4073.8 target tokens/batch, trg/src length ratio: 1.06 (+-0.17)
[INFO:sockeye.data_io] Bucket (16, 16): 130903 samples in 381 batches of 344, ~4140.7 target tokens/batch, trg/src length ratio: 1.03 (+-0.19)
[INFO:sockeye.data_io] Bucket (24, 24): 112306 samples in 520 batches of 216, ~4083.3 target tokens/batch, trg/src length ratio: 1.01 (+-0.19)
[INFO:sockeye.data_io] Bucket (32, 32): 71702 samples in 472 batches of 152, ~4011.4 target tokens/batch, trg/src length ratio: 1.00 (+-0.17)
[INFO:sockeye.data_io] Bucket (40, 40): 44421 samples in 371 batches of 120, ~4060.2 target tokens/batch, trg/src length ratio: 0.99 (+-0.16)
[INFO:sockeye.data_io] Bucket (48, 48): 25989 samples in 271 batches of 96, ~3964.7 target tokens/batch, trg/src length ratio: 0.99 (+-0.15)
[INFO:sockeye.data_io] Bucket (56, 56): 15214 samples in 173 batches of 88, ~4277.7 target tokens/batch, trg/src length ratio: 0.98 (+-0.15)
[INFO:sockeye.data_io] Bucket (64, 64): 8995 samples in 125 batches of 72, ~4035.0 target tokens/batch, trg/src length ratio: 0.97 (+-0.14)
[INFO:sockeye.data_io] Bucket (72, 72): 5432 samples in 85 batches of 64, ~4055.4 target tokens/batch, trg/src length ratio: 0.97 (+-0.14)
[INFO:sockeye.data_io] Bucket (80, 80): 3432 samples in 62 batches of 56, ~3954.0 target tokens/batch, trg/src length ratio: 0.96 (+-0.14)
[INFO:sockeye.data_io] Bucket (88, 88): 2552 samples in 46 batches of 56, ~4399.1 target tokens/batch, trg/src length ratio: 0.97 (+-0.14)
[INFO:sockeye.data_io] Bucket (96, 96): 2008 samples in 32 batches of 64, ~5529.5 target tokens/batch, trg/src length ratio: 0.98 (+-0.13)
[INFO:sockeye.data_io] Loading shard train_data/shard.00000.
[INFO:sockeye.data_io] =================================
[INFO:sockeye.data_io] Creating validation data iterator
[INFO:sockeye.data_io] =================================
[INFO:sockeye.data_io] 1802 sequences of maximum length (96, 96) in '/content/data/valid.tag.src' and '/content/data/valid.tag.trg'.
[INFO:sockeye.data_io] Mean training target/source length ratio: 1.02 (+-0.18)
[INFO:sockeye.data_io] Tokens: source 46171 target 45973
[INFO:sockeye.data_io] Number of <unk> tokens: source 2 target 2
[INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100%
[INFO:sockeye.data_io] 1802 sequences across 12 buckets
[INFO:sockeye.data_io] 15 sequences did not fit into buckets and were discarded
[INFO:sockeye.data_io] Bucket (8, 8): 61 samples in 1 batches of 600, ~4073.8 target tokens/batch, trg/src length ratio: 1.10 (+-0.19)
[INFO:sockeye.data_io] Bucket (16, 16): 448 samples in 2 batches of 344, ~4140.7 target tokens/batch, trg/src length ratio: 1.02 (+-0.19)
[INFO:sockeye.data_io] Bucket (24, 24): 451 samples in 3 batches of 216, ~4083.3 target tokens/batch, trg/src length ratio: 1.01 (+-0.19)
[INFO:sockeye.data_io] Bucket (32, 32): 359 samples in 3 batches of 152, ~4011.4 target tokens/batch, trg/src length ratio: 1.01 (+-0.17)
[INFO:sockeye.data_io] Bucket (40, 40): 178 samples in 2 batches of 120, ~4060.2 target tokens/batch, trg/src length ratio: 1.02 (+-0.19)
[INFO:sockeye.data_io] Bucket (48, 48): 124 samples in 2 batches of 96, ~3964.7 target tokens/batch, trg/src length ratio: 1.04 (+-0.17)
[INFO:sockeye.data_io] Bucket (56, 56): 66 samples in 1 batches of 88, ~4277.7 target tokens/batch, trg/src length ratio: 0.99 (+-0.16)
[INFO:sockeye.data_io] Bucket (64, 64): 46 samples in 1 batches of 72, ~4035.0 target tokens/batch, trg/src length ratio: 0.98 (+-0.14)
[INFO:sockeye.data_io] Bucket (72, 72): 23 samples in 1 batches of 64, ~4055.4 target tokens/batch, trg/src length ratio: 0.99 (+-0.11)
[INFO:sockeye.data_io] Bucket (80, 80): 25 samples in 1 batches of 56, ~3954.0 target tokens/batch, trg/src length ratio: 0.99 (+-0.12)
[INFO:sockeye.data_io] Bucket (88, 88): 11 samples in 1 batches of 56, ~4399.1 target tokens/batch, trg/src length ratio: 0.91 (+-0.08)
[INFO:sockeye.data_io] Bucket (96, 96): 10 samples in 1 batches of 64, ~5529.5 target tokens/batch, trg/src length ratio: 1.02 (+-0.10)
[INFO:sockeye.data_io] Created bucketed parallel data set. Introduced padding: source=16.5% target=16.8%)
[INFO:sockeye.vocab] Vocabulary saved to "/content/iwslt_model/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary saved to "/content/iwslt_model/vocab.trg.0.json"
[INFO:__main__] Vocabulary sizes: source=[29226] target=[29226]
[INFO:__main__] Source embedding size was not set it will automatically be adjusted to match the Transformer source model size (512).
[INFO:__main__] Target embedding size was not set it will automatically be adjusted to match the Transformer target model size (512).
[INFO:sockeye.model] ModelConfig(config_data=DataConfig(data_statistics=DataStatistics(num_sents=442550, num_discarded=2395, num_tokens_source=10662257, num_tokens_target=10504428, num_unks_source=0, num_unks_target=0, max_observed_len_source=96, max_observed_len_target=96, size_vocab_source=29226, size_vocab_target=29226, length_ratio_mean=1.0102479170496759, length_ratio_std=0.1793994964134372, buckets=[(8, 8), (16, 16), (24, 24), (32, 32), (40, 40), (48, 48), (56, 56), (64, 64), (72, 72), (80, 80), (88, 88), (96, 96)], num_sents_per_bucket=[19596, 130903, 112306, 71702, 44421, 25989, 15214, 8995, 5432, 3432, 2552, 2008], average_len_target_per_bucket=[6.7895999183506675, 12.036798239918186, 18.90396773102061, 26.39061671919884, 33.83516805114739, 41.298857208818845, 48.609767319573955, 56.0414674819342, 63.36616347569953, 70.6069347319347, 78.55446708463948, 86.39840637450203], length_ratio_stats_per_bucket=[(1.0629691798131742, 0.17423598623300557), (1.031108907820037, 0.1930931812097123), (1.0103493851348708, 0.1895598677461141), (0.9976084216183412, 0.1697442200761039), (0.9910226967134623, 0.15734516229462056), (0.985247167251298, 0.1495865401607151), (0.9798646721353192, 0.1515138948656466), (0.9724385239790109, 0.13814620473774203), (0.9676546696165079, 0.1372053441005369), (0.963085252277705, 0.14170991805183172), (0.9701350918651122, 0.14106125840309341), (0.9767222098806323, 0.13310340059289394)]), max_seq_len_source=96, max_seq_len_target=96, num_source_factors=1, num_target_factors=1), vocab_source_size=29226, vocab_target_size=29226, config_embed_source=EmbeddingConfig(vocab_size=29226, num_embed=512, dropout=0.0, num_factors=1, factor_configs=None, allow_sparse_grad=True), config_embed_target=EmbeddingConfig(vocab_size=29226, num_embed=512, dropout=0.0, num_factors=1, factor_configs=None, allow_sparse_grad=True), config_encoder=TransformerConfig(model_size=512, attention_heads=8, feed_forward_num_hidden=2048, act_type='relu', num_layers=6, dropout_attention=0.1, dropout_act=0.1, dropout_prepost=0.1, positional_embedding_type='fixed', preprocess_sequence='n', postprocess_sequence='dr', max_seq_len_source=96, max_seq_len_target=96, decoder_type='transformer', use_lhuc=False, depth_key_value=0, use_glu=False), config_decoder=TransformerConfig(model_size=512, attention_heads=8, feed_forward_num_hidden=2048, act_type='relu', num_layers=6, dropout_attention=0.1, dropout_act=0.1, dropout_prepost=0.1, positional_embedding_type='fixed', preprocess_sequence='n', postprocess_sequence='dr', max_seq_len_source=96, max_seq_len_target=96, decoder_type='transformer', use_lhuc=False, depth_key_value=512, use_glu=False), config_length_task=None, weight_tying_type='src_trg_softmax', lhuc=False, dtype='float32', intgemm_custom_lib='/usr/local/lib/python3.7/dist-packages/sockeye/libintgemm.so')
[INFO:sockeye.lr_scheduler] Will reduce the learning rate by a factor of 0.90 whenever the validation score doesn't improve 8 times.
[INFO:__main__] Optimizer: adam | kvstore=device | params={'wd': 0.0, 'learning_rate': 0.0002, 'rescale_grad': 1.0, 'lr_scheduler': LearningRateSchedulerPlateauReduce(reduce_factor=0.90, reduce_num_not_improved=8, num_not_improved=0, base_lr=None, lr=None, warmup=0, warmed_up=True)} | initializer=<mxnet.initializer.Xavier object at 0x7f7d08674cd0>
[INFO:__main__] Gradient accumulation over 1 batch(es) by 1 worker(s). Effective batch size: 4096
[INFO:sockeye.utils] # of parameters: 59194922 | trainable: 59096618 (99.83%) | fixed: 98304 (0.17%)
[INFO:sockeye.utils] Trainable parameters: 
['decoder_transformer_0_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_0_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_0_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_0_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_0_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_0_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_0_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_0_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_0_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_0_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_0_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_0_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_0_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_0_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_0_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_1_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_1_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_1_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_1_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_1_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_1_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_1_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_1_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_1_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_1_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_1_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_1_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_1_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_1_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_1_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_2_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_2_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_2_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_2_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_2_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_2_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_2_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_2_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_2_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_2_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_2_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_2_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_2_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_2_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_2_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_3_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_3_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_3_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_3_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_3_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_3_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_3_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_3_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_3_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_3_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_3_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_3_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_3_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_3_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_3_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_4_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_4_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_4_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_4_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_4_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_4_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_4_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_4_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_4_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_4_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_4_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_4_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_4_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_4_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_4_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_5_att_enc_h2o_weight [(512, 512), float32]',
 'decoder_transformer_5_att_enc_kv2h_weight [(1024, 512), float32]',
 'decoder_transformer_5_att_enc_pre_norm_beta [(512,), float32]',
 'decoder_transformer_5_att_enc_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_5_att_enc_q2h_weight [(512, 512), float32]',
 'decoder_transformer_5_att_self_h2o_weight [(512, 512), float32]',
 'decoder_transformer_5_att_self_i2h_weight [(1536, 512), float32]',
 'decoder_transformer_5_att_self_pre_norm_beta [(512,), float32]',
 'decoder_transformer_5_att_self_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_5_ff_h2o_bias [(512,), float32]',
 'decoder_transformer_5_ff_h2o_weight [(512, 2048), float32]',
 'decoder_transformer_5_ff_i2h_bias [(2048,), float32]',
 'decoder_transformer_5_ff_i2h_weight [(2048, 512), float32]',
 'decoder_transformer_5_ff_pre_norm_beta [(512,), float32]',
 'decoder_transformer_5_ff_pre_norm_gamma [(512,), float32]',
 'decoder_transformer_final_process_norm_beta [(512,), float32]',
 'decoder_transformer_final_process_norm_gamma [(512,), float32]',
 'encoder_transformer_0_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_0_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_0_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_0_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_0_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_0_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_0_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_0_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_0_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_0_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_1_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_1_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_1_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_1_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_1_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_1_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_1_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_1_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_1_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_1_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_2_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_2_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_2_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_2_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_2_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_2_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_2_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_2_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_2_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_2_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_3_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_3_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_3_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_3_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_3_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_3_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_3_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_3_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_3_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_3_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_4_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_4_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_4_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_4_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_4_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_4_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_4_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_4_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_4_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_4_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_5_att_self_h2o_weight [(512, 512), float32]',
 'encoder_transformer_5_att_self_i2h_weight [(1536, 512), float32]',
 'encoder_transformer_5_att_self_pre_norm_beta [(512,), float32]',
 'encoder_transformer_5_att_self_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_5_ff_h2o_bias [(512,), float32]',
 'encoder_transformer_5_ff_h2o_weight [(512, 2048), float32]',
 'encoder_transformer_5_ff_i2h_bias [(2048,), float32]',
 'encoder_transformer_5_ff_i2h_weight [(2048, 512), float32]',
 'encoder_transformer_5_ff_pre_norm_beta [(512,), float32]',
 'encoder_transformer_5_ff_pre_norm_gamma [(512,), float32]',
 'encoder_transformer_final_process_norm_beta [(512,), float32]',
 'encoder_transformer_final_process_norm_gamma [(512,), float32]',
 'source_target_embed_weight [(29226, 512), float32]',
 'target_output_bias [(29226,), float32]']
[INFO:sockeye.utils] Fixed parameters:
['decoder_transformer_target_pos_embed_weight [(96, 512), float32]',
 'encoder_transformer_source_pos_embed_weight [(96, 512), float32]']
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
[INFO:sockeye.loss] Loss: cross-entropy | weight=1.00 | metric: perplexity (ppl) | output_name: 'logits' | label_name: 'target_label'
[INFO:sockeye.training] Logging training events for Tensorboard at 'iwslt_model/tensorboard'
[INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [0]
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 62s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 37s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 17s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 69s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 13s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 32s.
[INFO:sockeye.utils] Acquired GPU master_lock.
[INFO:sockeye.utils] Could not acquire GPU 0. It's currently locked.
[INFO:sockeye.utils] Releasing GPU master_lock.
[INFO:sockeye.utils] Not enough GPUs available will try again in 51s.
[INFO:sockeye.utils] Releasing GPU 0.
[ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/sockeye/train.py", line 1149, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/sockeye/train.py", line 906, in main
    train(args)
  File "/usr/local/lib/python3.7/dist-packages/sockeye/train.py", line 1142, in train
    training_model, source_vocabs, target_vocabs, hybridize=hybridize)
  File "/usr/local/lib/python3.7/dist-packages/sockeye/train.py", line 225, in create_checkpoint_decoder
    exit_stack=exit_stack)[0]
  File "/usr/local/lib/python3.7/dist-packages/sockeye/utils.py", line 335, in determine_context
    context = exit_stack.enter_context(acquire_gpus(device_ids, lock_dir=lock_dir))
  File "/usr/lib/python3.7/contextlib.py", line 427, in enter_context
    result = _cm_type.__enter__(cm)
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/dist-packages/sockeye/utils.py", line 470, in acquire_gpus
    time.sleep(retry_wait_actual)
KeyboardInterrupt
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 587, in _notify_shutdown
    check_call(_LIB.MXNotifyShutdown())
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../src/common/random_generator.cu", line 58
Name: Check failed: err == cudaSuccess (209 vs. 0) : rand_generator_seed_kernel ErrStr:no kernel image is available for execution on the device

How to resolve this?

Note: The files mentioned in the notebook (taken from the aforementioned tutorial) can be viewed here.

  1. installation.sh
  2. download_and_move.sh
  3. preprocess.sh
  4. prepare_data.sh
  5. train.sh
tdomhan commented 3 years ago

Hi! How and what version of MXNet did you install? It seems like an issue related to MXNet and the specific GPU available on Google Colab.

ekdnam commented 3 years ago

Hi! How and what version of MXNet did you install?

I have installed MXNet as mentioned in requirements.gpu-cu110.txt, which is mxnet-cu110==1.8.0.post0, by executing

pip install sockeye --no-deps -r requirements.gpu-cu110.txt

It seems like an issue related to MXNet and the specific GPU available on Google Colab.

What can be the issue with MXNet?

I, personally, have never faced any problems with GPUs on Google Colab. Can you perhaps give some more information about what the error can be related to GPUs?

tdomhan commented 3 years ago

Thanks! So MXNet has kernels for different device types built into its binaries. It seems there is a mismatch between what it was built with and what is required by Google Colab. Unfortunately, this is a MXNet issue. Could you open an issue on the MXNet repository as I think MXNet developers would maybe be able to point out a potential way forward. https://github.com/apache/incubator-mxnet

ekdnam commented 3 years ago

Okay. Thanks for your response.

I have created an issue on the MXNet repo (apache/incubator-mxnet/issues/20469), let's see how it goes.

tdomhan commented 3 years ago

Thanks! 🤞 I will close the issue here and we can continue on the MXNet issue.