ConnorJL / GPT2

An implementation of training for GPT2, supports TPUs
MIT License
1.42k stars 338 forks source link

Retraining a new model, only gpu 0 can be used #32

Closed yds1024 closed 3 years ago

yds1024 commented 3 years ago

my batch size: "train_batch_size": 4,

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM3... Off | 00000000:34:00.0 Off | 0 | | N/A 62C P0 349W / 350W | 30630MiB / 32480MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM3... Off | 00000000:36:00.0 Off | 0 | | N/A 28C P0 70W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM3... Off | 00000000:39:00.0 Off | 0 | | N/A 37C P0 71W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM3... Off | 00000000:3B:00.0 Off | 0 | | N/A 57C P0 75W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM3... Off | 00000000:57:00.0 Off | 0 | | N/A 27C P0 68W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM3... Off | 00000000:59:00.0 Off | 0 | | N/A 36C P0 67W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM3... Off | 00000000:5C:00.0 Off | 0 | | N/A 30C P0 66W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM3... Off | 00000000:5E:00.0 Off | 0 | | N/A 38C P0 69W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 8 Tesla V100-SXM3... Off | 00000000:B7:00.0 Off | 0 | | N/A 30C P0 66W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 9 Tesla V100-SXM3... Off | 00000000:B9:00.0 Off | 0 | | N/A 30C P0 66W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 10 Tesla V100-SXM3... Off | 00000000:BC:00.0 Off | 0 | | N/A 36C P0 68W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 11 Tesla V100-SXM3... Off | 00000000:BE:00.0 Off | 0 | | N/A 38C P0 68W / 350W | 428MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 12 Tesla V100-SXM3... Off | 00000000:E0:00.0 Off | 0 | | N/A 30C P0 66W / 350W | 428MiB / 32480MiB | 0% Default |

ConnorJL commented 3 years ago

This repo isn't built for multi-GPU I'm afraid (and is also deprecated). I would recommend using Hugging Face's transformers instead.