StrongResearch / isc-demos

Deep learning examples for the Instant Super Computer
12 stars 0 forks source link

Lightning language model example #15

Closed StrongCalvin closed 1 year ago

StrongCalvin commented 1 year ago

Removed _set_epoch(). Everything now uses set_epoch(). All tests pass.

Lightning example works and has trained 10 epochs out of 50 so far

StrongChris commented 1 year ago

Please add the wikitext dataset to .gitignore

StrongChris commented 1 year ago
rank0.txt

``` WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 4] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 2] Global seed set to 42 Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) Namespace(lr=0.1, batch_size=20, save_dir=PosixPath('/mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277'), save_freq=5, epochs=50) [rank: 1] Global seed set to 42 [rank: 5] Global seed set to 42 [rank: 0] Global seed set to 42 [rank: 3] Global seed set to 42 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs [rank: 5] Global seed set to 42 Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs [rank: 2] Global seed set to 42 [rank: 1] Global seed set to 42 Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs [rank: 4] Global seed set to 42 [rank: 0] Global seed set to 42 Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs [rank: 3] Global seed set to 42 You are using a CUDA device ('NVIDIA GeForce RTX 3090 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs Missing logger folder: /mnt/Client/strongcompute_chris/isc-demos/lightning-examples/lightning_logs /mnt/Client/strongcompute_chris/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:626: UserWarning: Checkpoint directory /mnt/Client/strongcompute_chris/.output_lightning_example/exp_1277 exists and is not empty. rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.") LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5] | Name | Type | Params -------------------------------------- 0 | model | Transformer | 15.6 M -------------------------------------- 14.6 M Trainable params 1.0 M Non-trainable params 15.6 M Total params 62.543 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s] Sanity Checking: 0%| | 0/2 [00:00

How can I know if the job is making progress at all? Also any ideas re the traceback?