Closed StrongCalvin closed 1 year ago
Please add the wikitext dataset to .gitignore
``` WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 4] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 2] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 1] Global seed set to 42 [rank: 5] Global seed set to 42 [rank: 0] Global seed set to 42 [rank: 3] Global seed set to 42 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs [rank: 5] Global seed set to 42 Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs [rank: 2] Global seed set to 42 [rank: 1] Global seed set to 42 Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs [rank: 4] Global seed set to 42 [rank: 0] Global seed set to 42 Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs [rank: 3] Global seed set to 42 You are using a CUDA device ('NVIDIA GeForce RTX 3090 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:626: UserWarning: Checkpoint directory /mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277 exists and is not empty. rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.") LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] | Name | Type | Params -------------------------------------- 0 | model | Transformer | 15.6 M -------------------------------------- 14.6 M Trainable params 1.0 M Non-trainable params 15.6 M Total params 62.543 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s] Sanity Checking: 0%| | 0/2 [00:00, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00, ?it/s]/mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:00<00:00, 1.03it/s] Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 2.05it/s]/mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:426: PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( Training: 0it [00:00, ?it/s] Training: 0%| | 0/49 [00:00, ?it/s] Epoch 0: 0%| | 0/49 [00:00, ?it/s] WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 2] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 3] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 5] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 4] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 0] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 1] Global seed set to 42 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs [rank: 1] Global seed set to 42 [rank: 2] Global seed set to 42 [rank: 4] Global seed set to 42 [rank: 3] Global seed set to 42 [rank: 5] Global seed set to 42 [rank: 0] Global seed set to 42 You are using a CUDA device ('NVIDIA GeForce RTX 3090 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:626: UserWarning: Checkpoint directory /mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277 exists and is not empty. rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.") LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] | Name | Type | Params -------------------------------------- 0 | model | Transformer | 15.6 M -------------------------------------- 14.6 M Trainable params 1.0 M Non-trainable params 15.6 M Total params 62.543 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers finalizer() File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__ res = self._callback(*self._args, **self._kwargs) File "/usr/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir rmtree(tempdir) File "/usr/lib/python3.10/shutil.py", line 731, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/usr/lib/python3.10/shutil.py", line 729, in rmtree os.rmdir(path) OSError: [Errno 39] Directory not empty: '/tmp/pymp-hfo4o_3l' Sanity Checking: 0%| | 0/2 [00:00, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00, ?it/s]/mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:00<00:00, 1.30it/s] Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 2.59it/s]/mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:426: PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( Training: 0it [00:00, ?it/s] Training: 0%| | 0/49 [00:00, ?it/s] Epoch 0: 0%| | 0/49 [00:00, ?it/s] WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 2] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 1] Global seed set to 42 [rank: 3] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 0] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 5] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 4] Global seed set to 42 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs [rank: 3] Global seed set to 42 [rank: 1] Global seed set to 42 [rank: 5] Global seed set to 42 [rank: 2] Global seed set to 42 [rank: 0] Global seed set to 42 [rank: 4] Global seed set to 42 You are using a CUDA device ('NVIDIA GeForce RTX 3090 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:626: UserWarning: Checkpoint directory /mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277 exists and is not empty. rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.") LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] | Name | Type | Params -------------------------------------- 0 | model | Transformer | 15.6 M -------------------------------------- 14.6 M Trainable params 1.0 M Non-trainable params 15.6 M Total params 62.543 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s] Sanity Checking: 0%| | 0/2 [00:00, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00, ?it/s]/mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:00<00:00, 1.18it/s] Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 2.35it/s]/mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/torch/nn/modules/activation.py:1160: UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance. Prefer to use a boolean mask directly. (Triggered internally at ../aten/src/ATen/native/transformers/attention.cpp:150.) return torch._native_multi_head_attention( /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:426: PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices. warning_cache.warn( Training: 0it [00:00, ?it/s] Training: 0%| | 0/49 [00:00, ?it/s] Epoch 0: 0%| | 0/49 [00:00, ?it/s] ```
How can I know if the job is making progress at all? Also any ideas re the traceback?
Removed _set_epoch(). Everything now uses set_epoch(). All tests pass.
Lightning example works and has trained 10 epochs out of 50 so far