Continue Pretraining Llama7B with Huggingface trainer

Information in requirement: https://docs.google.com/document/d/1DQ5a56gFv2ZzRKn9PliNlOc-d-CI5vGAUUsdPihX4OI/edit#heading=h.5bb1u3i0p4rh

Additional Steps: 1 Understand Lanta Multinode training guidelines first https://app.gitbook.com/o/ygzlt6vZbi4mM0I2X5ko/s/rXqu9ENRkozaiYy0LTZK/lanta/multinode-training

Try to integrate multinode training with the Huggingface Trainer script
Integrate our own Huggingface dataset (V5_555) into Training code
Run all Checklist
Notify @kwan or @boat or @new before running full training
Run Training Note: This step is blocked by Task:Integrate Data Pipeline into OSCAR Colossal

Important Note: No need for fancy config script, or codebase. Just one train.py (200-400 lines) is enough for this task. Recommend modifying Standford Alpaca training code from Multinode training guidelines

Recommended People to do the task (Pick 1)

@New
@Boss
@Boat
@Bank
@Tae

OpenThaiGPT / openthaigpt-pretraining

Continue Pretraining Llama7B with Huggingface trainer #310