Open sachit-menon opened 2 years ago
I have not personally tried, but I believe you might be able to do it for inference since the idling VRAM usage is <50GB as far as I remember. I don't think you need to make any changes to constants.py
since it's 8 devices by default.
@sachit-menon @xhluca
Hi, We recently integrate OPT-175B with the Berkeley-backed system Alpa.
I guess you can try Alpa which exactly allows you to use more heterogeneous GPU setups to train/serve big models like OPT-175B, other than requiring 8x 80GB A100.
See this guide for more details!
Just as a side-note, we've converted the 175B checkpoint internally and it runs very well with accelerate
on 8 A100s (faster or as fast as deepspeed for single-input queries).
You'll just need to follow the conversion scripts to get the opt-175B checkpoint into HF format :-)
@patrickvonplaten sorry if this is obvious, but where can I find those scripts? Also great to hear about accelerate, do you know where we can find example code?
I tried running the official tutorial with one of the smaller models and it seems that only a single gpu was used, so I'm not sure if it's because I did something wrong or there some special steps needed for those large models.
Hey @xhluca,
Sorry to answer so late - we're working with the meta team to make the HF weights directly available with a short code snippet on how to run them :-)
I'll keep you updated here
Thanks!
We've also got model parallel 16 working now, which is useful for running across two nodes (with some perf penalty)
Just as a side-note, we've converted the 175B checkpoint internally and it runs very well with accelerate on 8 A100s (faster or as fast as deepspeed for single-input queries).
8 A100s with 40GB or 80GB memory?
@samuelstevens Based on personal experiments, it takes ~50GB of VRAM per device, so likely done on A100 80GB.
@patrickvonplaten @stephenroller any idea what kind of timeframe that might happen in? How involved is it to use the conversion script/what are the steps involved in that, if it would be faster to do it that way? Thanks for your efforts on this!
In metaseq we now have support for MP16 which lets it work on 16x 32gb nodes. @punitkoura also has a PR for converting the weights for use with HF accelerate which supports much smaller machines via offloading (at a speed cost)
@punitkoura would it be possible to have a brief example/explanation of how to use that with the OPT 175B model? Digging into these commits, I found https://github.com/facebookresearch/metaseq/blob/main/gpu_tests/test_hf_compatibility.py with associated test setup download_and_configure_125m_with_hf_dependencies in https://github.com/facebookresearch/metaseq/blob/main/.circleci/config.yml. However, I'm not sure if 175B has the dependencies or config files that 125M has available, so I don't know if following the exact same steps will work as-is.
Perhaps @patrickvonplaten has used this updated conversion script to put together that code snippet? 😄
Hey @xhluca,
Sorry to answer so late - we're working with the meta team to make the HF weights directly available with a short code snippet on how to run them :-)
@patrickvonplaten Since it might take some time to make it available, in the mean time is it possible to have a code snippet for the publicly available large models (30B and 60B)? This way, when 175B will be made available, the only thing needed is to change a single line of code.
@patrickvonplaten Bump on @xhluca 's comment above ? Sorry to ping you again, it's been about a month and I have some experiments that I'd really like to run!
Hey!
The 30b and 60b models can very easily be run if you have enough GPUs available. You can just follow the code snippets here: https://huggingface.co/facebook/opt-30b#how-to-use and https://huggingface.co/facebook/opt-66b#how-to-use
Note that this assumes you have a 80GB A100 available. If instead you have just multiple GPUs available you can replace:
model = AutoModelForCausalLM.from_pretrained("facebook/opt-66b", torch_dtype=torch.float16).cuda()
by:
model = AutoModelForCausalLM.from_pretrained("facebook/opt-66b", torch_dtype=torch.float16, device_map="auto")
which will automatically place layers on the different devices in a smart way. Also see: https://huggingface.co/docs/transformers/v4.21.2/en/main_classes/model#large-model-loading
Is an easy integration with 175B still planned or shelved for now?
@stephenroller - could we provide you with the HF 175B checkpoint and you provide it somehow on your website (upon request)?
Yes Patrick, that would significantly help unblock this.
Awesome - send you an email :-)
❓ Questions and Help
Before asking:
What is your question?
How can I get the 175B model running for inference on a hardware setup as described below? Is it possible on one node with 8 A6000s with 51GB each, perhaps with DeepSpeed or similar? I know there are multiple other similar issues, but I'm wondering if the requirements can be somewhat relaxed for inference only (and my hardware setup is a bit different), so I thought I'd throw my question into the ring :).
What's your environment?
pip
, source): per instructions in https://github.com/facebookresearch/metaseq/blob/main/docs/setup.md