EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.72k stars 981 forks source link

Migrate tensor parallelism code to use OSLO #578

Open sdtblck opened 2 years ago

sdtblck commented 2 years ago

Is your feature request related to a problem? Please describe. Would be good to remove the megatron tensor parallelism code from NeoX, and OSLO currently has support for this, and a slightly nicer interface.

Describe the solution you'd like

Steps:

hyunwoongko commented 2 years ago

I will actively support this work.

hyunwoongko commented 2 years ago

The main problem is that currently the model is loaded on the CPU and then moved to the GPU. OSLO was originally designed for transformers, and there was no way to pass downloaded checkpoints directly to the GPU in the transformers. (At least when I'm developing, so I didn't care about this) But we need to implement something like deepspeed.ZeroInit internally so that it's allocated to the GPU from scratch. I will try this right from tomorrow.

sdtblck commented 2 years ago

@hyunwoongko actually in neox we also load onto the CPU and then move to the GPU, so i'm not sure this is a problem

StellaAthena commented 2 years ago

The main problem is that currently the model is loaded on the CPU and then moved to the GPU. OSLO was originally designed for transformers, and there was no way to pass downloaded checkpoints directly to the GPU in the transformers. (At least when I'm developing, so I didn't care about this) But we need to implement something like deepspeed.ZeroInit internally so that it's allocated to the GPU from scratch. I will try this right from tomorrow.

this is actually something we have a work-around for. I don't know if Transformers ever got around to merging it though.

hyunwoongko commented 2 years ago

@sdtblck please check my branch. https://github.com/EleutherAI/gpt-neox/tree/kevin_new I am restructuring our code based on plain torch.

hyunwoongko commented 2 years ago

@sdtblck Did you check my branch?

Quentin-Anthony commented 1 year ago

@hyunwoongko -- Would you like to restart this effort?