Add OPT to huggingface conversion guidelines/scripts

tbmihailov commented 2 years ago

🚀 Feature Request

Add documentation for converting fine-tuned OPT models to huggingface

Motivation

HuggingFace added OPT to their suite which is a great win for the community! It would be very helpful to be able to easily convert OPT models fine-tuned inside metaseq to HF models so they could be used for inference with the same HF api.

stephenroller commented 2 years ago

cc @patrickvonplaten could y'all share the scripts used on your side to do the singleton->HF conversion?

patrickvonplaten commented 2 years ago

Yes definitely!

Assuming you have a fairseq singleton pytorch file, called restored.pt, you then need to define a OPTConfig class from transformers. In should be as simple as executing the following command:

from transformers import OPTConfig

num_layers = <fill-in>
num_heads = <fill-in>
d_model = <fill-in>

config = OPTConfig(hidden_size=d_model, num_attention_heads=num_heads, num_hidden_layers=num_layers, ffn_dim=4*d_model)
config.save_pretrained("./")  # <- this will create a `config.json` in your current folder

Once you have the config you can run the following script: https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/convert_opt_original_pytorch_checkpoint_to_pytorch.py

E.g.:

python convert_opt_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path <path/to/dump/hf/model> --hf_config config.json --fairseq_path <path/to/restored.py>

patrickvonplaten commented 2 years ago

Note that you can also find all the fairseq singleton files here: https://huggingface.co/models?other=opt_metasq

GabrielPereyra commented 2 years ago

Happy to add this documentation - let me know if there is any additional context I might need.

StellaAthena commented 2 years ago

Has the MetaSeq -> HuggingFace conversion been tested on the 175B parameter model? I would be interested in converting the 175B parameter model to HF if possible since my codebase is compatible with their API.

patrickvonplaten commented 2 years ago

Hey @StellaAthena - haven't looked into it yet! We've downloaded the weights and merged all the downloaded files to 8 shards. In my opinion, it should work to merge those 8 shards into one file with more or less the same script that we used before, but we're not yet sure how we can distribute the model via the Hub from the legal side of things, so thought it doesn't make too much sense to work on it yet.

But happy to look into it next week if it helps!

StellaAthena commented 2 years ago

@patrickvonplaten I had missed the fact that the OPT -> HF process is documented upthread. I anticipate being able to do that without much issue. With that and DeepSpeed I anticipate being able to get the model running through HF and will let you know how it goes.

Something else to put on your radar is the lack of accelerate support for OPT. I assume I’ll have no issue using DeepSpeed, but it’s probably worth prioritizing accelerate given the more user-friendly API :)

patrickvonplaten commented 2 years ago

Thanks for the message @StellaAthena

ALso cc @sgugger @LysandreJik

LysandreJik commented 2 years ago

Hey @StellaAthena, accelerate does have support for OPT! We mentioned it in this tweet, and we have this colab which demonstrates how to run the 30B model in colab pro directly (please be aware it's super slow as most of the model is on disk, will be faster locally).

Or did you mean something else?

StellaAthena commented 2 years ago

@LysandreJik I see, I seem to have gotten my frameworks mixed up. OPT doesn't support parallelize but it does appear to support accelerate. When I try using it with parallelize I get AttributeError: 'OPTForCausalLM' object has no attribute 'parallelize'

sgugger commented 2 years ago

The parallelize API is going to be deprecated in the coming days. The way to parallelize the model is now:

model = AutoModelForXxx.from_pretrained(checkpoint, device_map="auto")

or passing an explicit device_map like in the Colab example (on latest main branch of Transformers). It supports CUDA devices as well as offload to CPU/disk if there is not enough GPU RAM

xhluca commented 2 years ago

Please correct me if i'm wrong, but isn't the parallelize() method in huggingface doing a layer-wise parallelization? That's my understanding from reading the source code, though I might be misunderstanding: https://github.com/huggingface/transformers/blob/56b35ce3ebeb1edb53ef98b3ad3557f79ce788e2/src/transformers/models/t5/modeling_t5.py#L860-L878

If that's the case there's only 1 GPU being used at any time (so the remaining ones are idling) if we base on this blog post:

the main deficiency and why this one is called “naive” MP, is that all but one GPU is idle at any given moment. So if 4 GPUs are used, it’s almost identical to quadrupling the amount of memory of a single GPU, and ignoring the rest of the hardware. Plus there is the overhead of copying the data between devices. So 4x 6GB cards will be able to accommodate the same size as 1x 24GB card using naive MP, except the latter will complete the training faster, since it doesn’t have the data copying overhead. But, say, if you have 40GB cards and need to fit a 45GB model you can with 4x 40GB cards (but barely because of the gradient and optimizer states)

I believe that this repo uses either fairscale or deepspeed which would also perform tensor parallelism, which utilizes all GPUs (again, I'm inferring from the blog post).

So my question is: if use AutoModelForXxx.from_pretrained(checkpoint, device_map="auto"), which parallelism will be supported?

sgugger commented 2 years ago

Same naive model parallelism, and this all for inference only where the speed gain is going to be minimal.

For training we recommend the use of DeepSpeed.

xhluca commented 2 years ago

Thanks, that makes sense. I wonder if there's a way to increase the efficiency during inference to be >13% utilization considering the cost of such hardware. Or maybe the only thing is to wait for GPUs with 400GB of VRAM.

stephenroller commented 2 years ago

While outside the scope of the metaseq repo, I will refer to the pytorch pipeline parallel docs about how to use minibatching to improve the throughput and utilization of pipeline parallelism.

There's an implementation in ParlAI, which may be more distracting than useful, that may be of interest to some. It predates the official PyTorch pipeline support: https://github.com/facebookresearch/ParlAI/blob/dfcfba0a77e9d96ddfd97209cd955b989ca9c010/parlai/utils/torch.py#L309-L321

xhluca commented 2 years ago

Just an update. I got the API up and running, and when I run nvidia-smi it seems that it has a >70% utilization on all devices which is pretty good. I'm not sure what is used behind the scene (deepspeed maybe?) but it might be a good option for the huggingface implementation.

patrickvonplaten commented 2 years ago

Also see: https://github.com/facebookresearch/metaseq/pull/164

facebookresearch / metaseq

Add OPT to huggingface conversion guidelines/scripts #98

🚀 Feature Request

Motivation