CERC-AAI / Robin

Apache License 2.0
62 stars 8 forks source link

Train mem overhaul #23

Open daniel-z-kaplan opened 10 months ago

daniel-z-kaplan commented 10 months ago

Setup code for individual clusters more cleanly

Alexis-BX commented 10 months ago

Rework the scripts folder completely Have folders for llava_v1, llava_v1.5, robin_v1, robin_v2 and evals In robin_v2 have a folder for each cluster with install, pretrain, finetune script (include cedar and frontier folders)

Use of train_mem.py : when doing multinode training, environment variables are not properly set by the launch script (set them on main node but not the others) As train_mem is run on every node this sets the variable properly.

Once the above reorganization is done: split train_mem into a seperate file for each cluster and put it in that cluster's folder