CentaurusInfra / alnair

Intelligent platform for AI workloads
Apache License 2.0
37 stars 12 forks source link

Wrap MLPerf training scripts into Unified Job yaml format and create testing script #39

Open Fizzbb opened 2 years ago

Fizzbb commented 2 years ago

Check out MLPerf training reference implementation and company's submission results. Find the ones can be used in Alnair cluster (no specific hardware requirements, like A100, TPU...). Download data set to NSF drive. Wrap into Kubernetes compatible format, and test out in clusters. Note: Prepare at least 3 different types of training jobs (object detection, reinforcement learning ...)

Fizzbb commented 2 years ago

Chose Nvidia's DeepLearningExample repo, which is better maintained. SageMaker also use this repo for some of their examples. Currently picked 1) Pytorch, Image Segmentation, MaskRCNN; Coco dataset, 40GB 2) Pytorch, Translation, Transformer; WMT English-German dataset, 14 G 3) Pytorch, Forecasting, TFT; electricity and traffic benchmark datasets.

Fizzbb commented 2 years ago

Issue 1: transformer image built based on nvidia driver 465, current machine installed driver 460. Container cannot run. Upgrade machine driver version to 470. Accidentally upgrade kubeadm version from 1.21.1-00 to 1.23, cannot join the cluster (version 1.21). remove kubeadm, kubelet, kubectl and reinstall with version.

Fizzbb commented 2 years ago

Copy training data to /nfs_3/alnair Control training time (max epoch/iteration, loss) to end early for testing purpose. Performance data reported after training ends. Default training use 8 cards. Titan34 only has 7 works.

Fizzbb commented 2 years ago

in the transformer image validation stage, when calculate bleu score against 'valid.raw.de' dataset, trigger OOM issue, try to change dataset different ways, cause mismatch stream length. Not sure the detailed issue, comment the bleu score function for now.

Fizzbb commented 2 years ago

Transformer image save checkpoint, one file is 2.7GB! And after one success run and saved checkedpoint, training will not rerun next time launch the job. Checkpoint file is found in the reference path, so no re-computation.