OctoberChang / X-Transformer

X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
BSD 3-Clause "New" or "Revised" License
135 stars 28 forks source link

No space left on device #17

Closed Khalid-Usman closed 2 years ago

Khalid-Usman commented 2 years ago

In Matcher, I found the following error,

08/07/2021 02:16:27 - INFO - main - Running training 08/07/2021 02:16:27 - INFO - main - Num examples = 15449 08/07/2021 02:16:27 - INFO - main - Num Epochs = 3 08/07/2021 02:16:27 - INFO - main - Instantaneous batch size per GPU = 8 08/07/2021 02:16:27 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 32 08/07/2021 02:16:27 - INFO - main - Gradient Accumulation steps = 4 08/07/2021 02:16:27 - INFO - main - Total optimization steps = 1000 08/07/2021 02:18:56 - INFO - main - | [ 1/ 3][ 100/ 1000] | 399/1932 batches | ms/batch 4.5376 | train_loss 5.790506e-01 | lr 5.000000e-05 08/07/2021 02:21:27 - INFO - main - | [ 1/ 3][ 200/ 1000] | 799/1932 batches | ms/batch 4.6409 | train_loss 2.755803e-01 | lr 4.444444e-05 08/07/2021 02:23:57 - INFO - main - | [ 1/ 3][ 300/ 1000] | 1199/1932 batches | ms/batch 4.5217 | train_loss 2.105729e-01 | lr 3.888889e-05 08/07/2021 02:26:27 - INFO - main - | [ 1/ 3][ 400/ 1000] | 1599/1932 batches | ms/batch 4.5165 | train_loss 1.729266e-01 | lr 3.333333e-05 08/07/2021 02:28:58 - INFO - main - | [ 2/ 3][ 500/ 1000] | 67/1932 batches | ms/batch 4.5452 | train_loss 1.577764e-01 | lr 2.777778e-05 08/07/2021 02:31:29 - INFO - main - | [ 2/ 3][ 600/ 1000] | 467/1932 batches | ms/batch 4.5317 | train_loss 1.491586e-01 | lr 2.222222e-05 08/07/2021 02:33:59 - INFO - main - | [ 2/ 3][ 700/ 1000] | 867/1932 batches | ms/batch 4.4462 | train_loss 1.405306e-01 | lr 1.666667e-05 08/07/2021 02:36:30 - INFO - main - | [ 2/ 3][ 800/ 1000] | 1267/1932 batches | ms/batch 4.5738 | train_loss 1.316148e-01 | lr 1.111111e-05 08/07/2021 02:39:00 - INFO - main - | [ 2/ 3][ 900/ 1000] | 1667/1932 batches | ms/batch 4.5304 | train_loss 1.185597e-01 | lr 5.555556e-06 08/07/2021 02:41:31 - INFO - main - | [ 3/ 3][ 1000/ 1000] | 135/1932 batches | ms/batch 4.4576 | train_loss 1.158848e-01 | lr 0.000000e+00 08/07/2021 02:41:33 - INFO - transformers.configuration_utils - Configuration saved in ./save_models/Eurlex-4K/pifa-tfidf-s0/matcher/bert-large-cased-whole-word-masking/config.json Traceback (most recent call last): File "xbert/transformer.py", line 678, in main() File "xbert/transformer.py", line 626, in main matcher.save_model(args) File "xbert/transformer.py", line 335, in save_model model_to_save.save_pretrained(args.output_dir) File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/site-packages/transformers/modeling_utils.py", line 249, in save_pretrained torch.save(model_to_save.state_dict(), output_model_file) File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/site-packages/torch/serialization.py", line 224, in save return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/site-packages/torch/serialization.py", line 149, in _with_file_like return body(f) File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/site-packages/torch/serialization.py", line 224, in return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/site-packages/torch/serialization.py", line 302, in _save serialized_storages[key]._write_file(f, _should_read_directly(f)) RuntimeError: write(): fd 39 failed with No space left on device Traceback (most recent call last): File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 246, in main() File "/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 242, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/khalid/anaconda3/envs/pt1.2_xmlc_transformer/bin/python', '-u', 'xbert/transformer.py', '--local_rank=0', '-m', 'bert', '-n', 'bert-large-cased-whole-word-masking', '--do_train', '-x_trn', './save_models/Eurlex-4K/proc_data/X.trn.bert.128.pkl', '-c_trn', './save_models/Eurlex-4K/proc_data/C.trn.pifa-tfidf-s0.npz', '-o', './save_models/Eurlex-4K/pifa-tfidf-s0/matcher/bert-large-cased-whole-word-masking', '--overwrite_output_dir', '--per_device_train_batch_size', '8', '--gradient_accumulation_steps', '4', '--max_steps', '1000', '--warmup_steps', '100', '--learning_rate', '5e-5', '--logging_steps', '100']' returned non-zero exit status 1.