Open daniel-z-kaplan opened 9 months ago
Rework the scripts folder completely Have folders for llava_v1, llava_v1.5, robin_v1, robin_v2 and evals In robin_v2 have a folder for each cluster with install, pretrain, finetune script (include cedar and frontier folders)
Use of train_mem.py : when doing multinode training, environment variables are not properly set by the launch script (set them on main node but not the others) As train_mem is run on every node this sets the variable properly.
Once the above reorganization is done: split train_mem into a seperate file for each cluster and put it in that cluster's folder
Setup code for individual clusters more cleanly