Open Fizzbb opened 2 years ago
Chose Nvidia's DeepLearningExample repo, which is better maintained. SageMaker also use this repo for some of their examples. Currently picked 1) Pytorch, Image Segmentation, MaskRCNN; Coco dataset, 40GB 2) Pytorch, Translation, Transformer; WMT English-German dataset, 14 G 3) Pytorch, Forecasting, TFT; electricity and traffic benchmark datasets.
Issue 1: transformer image built based on nvidia driver 465, current machine installed driver 460. Container cannot run. Upgrade machine driver version to 470. Accidentally upgrade kubeadm version from 1.21.1-00 to 1.23, cannot join the cluster (version 1.21). remove kubeadm, kubelet, kubectl and reinstall with version.
Copy training data to /nfs_3/alnair Control training time (max epoch/iteration, loss) to end early for testing purpose. Performance data reported after training ends. Default training use 8 cards. Titan34 only has 7 works.
in the transformer image validation stage, when calculate bleu score against 'valid.raw.de' dataset, trigger OOM issue, try to change dataset different ways, cause mismatch stream length. Not sure the detailed issue, comment the bleu score function for now.
Transformer image save checkpoint, one file is 2.7GB! And after one success run and saved checkedpoint, training will not rerun next time launch the job. Checkpoint file is found in the reference path, so no re-computation.
Check out MLPerf training reference implementation and company's submission results. Find the ones can be used in Alnair cluster (no specific hardware requirements, like A100, TPU...). Download data set to NSF drive. Wrap into Kubernetes compatible format, and test out in clusters. Note: Prepare at least 3 different types of training jobs (object detection, reinforcement learning ...)