Open tbmihailov opened 2 years ago
cc @patrickvonplaten could y'all share the scripts used on your side to do the singleton->HF conversion?
Yes definitely!
Assuming you have a fairseq singleton pytorch file, called restored.pt
, you then need to define a OPTConfig
class from transformers. In should be as simple as executing the following command:
from transformers import OPTConfig
num_layers = <fill-in>
num_heads = <fill-in>
d_model = <fill-in>
config = OPTConfig(hidden_size=d_model, num_attention_heads=num_heads, num_hidden_layers=num_layers, ffn_dim=4*d_model)
config.save_pretrained("./") # <- this will create a `config.json` in your current folder
E.g.:
python convert_opt_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path <path/to/dump/hf/model> --hf_config config.json --fairseq_path <path/to/restored.py>
Note that you can also find all the fairseq singleton files here: https://huggingface.co/models?other=opt_metasq
Happy to add this documentation - let me know if there is any additional context I might need.
Has the MetaSeq -> HuggingFace conversion been tested on the 175B parameter model? I would be interested in converting the 175B parameter model to HF if possible since my codebase is compatible with their API.
Hey @StellaAthena - haven't looked into it yet! We've downloaded the weights and merged all the downloaded files to 8 shards. In my opinion, it should work to merge those 8 shards into one file with more or less the same script that we used before, but we're not yet sure how we can distribute the model via the Hub from the legal side of things, so thought it doesn't make too much sense to work on it yet.
But happy to look into it next week if it helps!
@patrickvonplaten I had missed the fact that the OPT -> HF process is documented upthread. I anticipate being able to do that without much issue. With that and DeepSpeed I anticipate being able to get the model running through HF and will let you know how it goes.
Something else to put on your radar is the lack of accelerate
support for OPT. I assume I’ll have no issue using DeepSpeed, but it’s probably worth prioritizing accelerate
given the more user-friendly API :)
Thanks for the message @StellaAthena
ALso cc @sgugger @LysandreJik
@LysandreJik I see, I seem to have gotten my frameworks mixed up. OPT doesn't support parallelize
but it does appear to support accelerate
. When I try using it with parallelize
I get AttributeError: 'OPTForCausalLM' object has no attribute 'parallelize'
The parallelize
API is going to be deprecated in the coming days. The way to parallelize the model is now:
model = AutoModelForXxx.from_pretrained(checkpoint, device_map="auto")
or passing an explicit device_map
like in the Colab example (on latest main branch of Transformers). It supports CUDA devices as well as offload to CPU/disk if there is not enough GPU RAM
Please correct me if i'm wrong, but isn't the parallelize()
method in huggingface doing a layer-wise parallelization? That's my understanding from reading the source code, though I might be misunderstanding: https://github.com/huggingface/transformers/blob/56b35ce3ebeb1edb53ef98b3ad3557f79ce788e2/src/transformers/models/t5/modeling_t5.py#L860-L878
If that's the case there's only 1 GPU being used at any time (so the remaining ones are idling) if we base on this blog post:
the main deficiency and why this one is called “naive” MP, is that all but one GPU is idle at any given moment. So if 4 GPUs are used, it’s almost identical to quadrupling the amount of memory of a single GPU, and ignoring the rest of the hardware. Plus there is the overhead of copying the data between devices. So 4x 6GB cards will be able to accommodate the same size as 1x 24GB card using naive MP, except the latter will complete the training faster, since it doesn’t have the data copying overhead. But, say, if you have 40GB cards and need to fit a 45GB model you can with 4x 40GB cards (but barely because of the gradient and optimizer states)
I believe that this repo uses either fairscale or deepspeed which would also perform tensor parallelism, which utilizes all GPUs (again, I'm inferring from the blog post).
So my question is: if use AutoModelForXxx.from_pretrained(checkpoint, device_map="auto")
, which parallelism will be supported?
Same naive model parallelism, and this all for inference only where the speed gain is going to be minimal.
For training we recommend the use of DeepSpeed.
Thanks, that makes sense. I wonder if there's a way to increase the efficiency during inference to be >13% utilization considering the cost of such hardware. Or maybe the only thing is to wait for GPUs with 400GB of VRAM.
While outside the scope of the metaseq repo, I will refer to the pytorch pipeline parallel docs about how to use minibatching to improve the throughput and utilization of pipeline parallelism.
There's an implementation in ParlAI, which may be more distracting than useful, that may be of interest to some. It predates the official PyTorch pipeline support: https://github.com/facebookresearch/ParlAI/blob/dfcfba0a77e9d96ddfd97209cd955b989ca9c010/parlai/utils/torch.py#L309-L321
Just an update. I got the API up and running, and when I run nvidia-smi it seems that it has a >70% utilization on all devices which is pretty good. I'm not sure what is used behind the scene (deepspeed maybe?) but it might be a good option for the huggingface implementation.
🚀 Feature Request
Add documentation for converting fine-tuned OPT models to huggingface
Motivation
HuggingFace added OPT to their suite which is a great win for the community! It would be very helpful to be able to easily convert OPT models fine-tuned inside metaseq to HF models so they could be used for inference with the same HF api.