Closed mayank31398 closed 2 years ago
This is still WIP @stas00
@stas00, I believe the code is in good shape. You can start reviewing. Also, I'll start updating the README for instructions tomorrow
Oh wow, you developed a whole package around this. Impressive work, @mayank31398
Let's just discuss the user-facing APIs before implementing to minimize wasting time.
Originally, these were just scripts to benchmark and demo how the various solutions can be used, but now I'm not so sure how demo-friendly these are with so many separate files.
Some design questions:
nvmi
path as well and flags for cpu/nvme offload? I just had those hardcoded - but there are 3 ways to invoke this one. gpu/cpu-offload/nvme-offloadcli.py
isn't debug friendly - can we activate it only with an explicit flag, but by default just use pre-made inputs - or perhaps 2 separate scripts? I'm not even sure who would ever want to manually type text - perhaps having a flag to feed it a file with prompts instead? not sure.cli.py
is very ambiguous since it tells me nothing of what it does. Please see my suggestions at https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328#pullrequestreview-1070168477, but I suppose that the way it's going it's no longer bloom-specific either, and can probably work with any other model.I'm also no longer sure what to do with this package. Originally the plan was to stash the original demo scripts under transformers' research files, but now it's growing towards an independent package. Not a bad thing, I'm just thinking aloud.
Let me know your thoughts on this. Thanks for the review :)
Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM. I can open up another PR to support other models. Adding so much functionality in a single PR might not be the best thing to do.
- I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.
so you are not benchmarking the same thing then.
you don't need to broadcast anything for non-server usage. It sounds like you built cli.py
as a client to the server, but that's too complicated for someone who just wants to process a few inputs.
The intention of the cli was the way it was in the original code - tokenize + generate + detokenize - done.
Perhaps we have miscommunicated here.
2. I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.
I answered earlier in the code comment.
3. I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.
I meant the script should work out of the box and require no user input. I think we have a conflict of use cases here. I wanted an easy to understand and replicate demo, you want a standalone system.
Perhaps the right solution is to just have the original scripts as they are for the demo/benchmarking purpose and then your package that is for the production use. What do you think? There will be a bit of duplication, but as you're saying you don't care for solutions that are not fast enough, which is fair for your needs.
What crashes?
4. batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now). 5. I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.
for end user scripts there should be no need to import them.
Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM. I can open up another PR to support other models. Adding so much functionality in a single PR might not be the best thing to do.
agreed, let's stick to bloom for now. And that's why I suggested that then the naming of the scripts should indicate that it's bloom and your version doesn't.
you don't need to broadcast anything for non-server usage. It sounds like you built
cli.py
as a client to the server, but that's too complicated for someone who just wants to process a few inputs.
cli works this way. The user only needs to provide input_text. Also, I need to call generate method on all 8 GPU processes (in non-server usage), and since the input text will only be requested on 1 process. You need to broadcast it as far as I understand. Yes cli for ds-inference is built as a client to the server. I can change it though if we want. In the original ds-inference script @stas00 , you didnt need to broadcast since each process was creating the list.
I will incorporate your suggestions.
So, just to summarize for cli.py: You want the option for user to provide an input text file?
Also, @stas00 , I am fine with both. If we want, we can keep this branch as a separate branch if users want to leverage a standalone system and maybe just add an instruction to the README in main branch to use this branch for deployment purposes. What do you think?
branch, no, as branches are hard to keep in sync - and especially since we will move it out of Meg-DS anyway once you're done . I propose let's create 2 sub-folders
bloom-inference-scripts
bloom-inference-server
I think that would allow meeting different use cases w/o getting in the way of each other.
I know I was the one who proposed to refactor to support your server work, but now I can see that keeping both versions is probably the simplest design choice.
So, just to summarize for cli.py: You want the option for user to provide an input text file?
If we keep the original scripts as they there are I think we can keep your version as it is for now w/o any cmd args changes. If someone asks for it it can be added later.
Thanks for the comments Stas. Will incorporate the changes
I have added back the older scripts in scripts/inference folder (original path). Will update README.md
I have added back the older scripts in scripts/inference folder (original path).
and let's move the 2 solutions to their respective sub-dirs:
bloom-inference-scripts
bloom-inference-server
we don't want them to be together, as it'd be difficult to understand quickly.
thank you, @mayank31398
I have added back the older scripts in scripts/inference folder (original path).
and let's move the 2 solutions to their respective sub-dirs:
bloom-inference-scripts bloom-inference-server
we don't want them to be together, as it'd be difficult to understand quickly.
thank you, @mayank31398
I assume these 2 folders should be located inside scripts/ ? or in the root folder?
I assume these 2 folders should be located inside
scripts/
yes, please.
Done. Added README too @stas00
I want to remove the temporary json created while caching. However, the new commit in DeepSpeed master branch is not working for me. https://github.com/microsoft/DeepSpeed/pull/2132#issuecomment-1221592051 Waiting for a fix @RezaYazdaniAminabadi
Looks great, @mayank31398 - thank you!
I need to think how I could test the functionality. Probably might work from the local client on the same host (as the node I have access to has no outside access).
I see @stas00 I have a script that does that for you. Are you saying the node you have can only be accessed from a login node or something? If that is how your setup is, I have a similar setup. Our compute nodes have no outside access, only the login node. If I am misunderstanding the probem, ignore this :trollface: I use the following script to test from my local machine:
from sshtunnel import SSHTunnelForwarder
import http
import requests
import paramiko
class BloomServer:
def __init__(self, bloom_ip="<IP HERE>", bloom_port=<PORT HERE>, relative_path="/generate/",
private_key_path = "<PATH TO ID_RSA>"):
pkey = paramiko.RSAKey.from_private_key_file(private_key_path)
self.remote_host = "<SSH IP LOGIN NODE>"
self.remote_user = "<USERNAME>"
self.remote_port = <PORT FOR LOGIN NODE>
self.bloom_ip = bloom_ip
self.bloom_port = bloom_port
self.server = SSHTunnelForwarder(
ssh_address_or_host=self.remote_host,
ssh_username=self.remote_user,
ssh_private_key=pkey,
remote_bind_address=(self.bloom_ip, self.bloom_port),
)
self.server.skip_tunnel_checkup = False
self.server.start()
self.server.check_tunnels()
print(self.server.tunnel_is_up, flush=True)
server_local_url = "127.0.0.1:" + str(self.server.local_bind_address[1])
self.local_url = "http://" + server_local_url + relative_path
def get_response(self, prompt):
min_length = 1
top_k = 5
top_p = 1
temperature = 1
max_new_tokens = 10
request_body = {
"text": prompt,
"min_length": min_length,
"top_p": top_p,
"top_k":top_k,
"temperature": temperature,
"max_new_tokens": max_new_tokens
}
try:
response = requests.post(self.local_url, json=request_body, verify=False)
return response.json(), "OK"
except:
return {}, "failed"
if __name__=="__main__":
server = BloomServer()
prompt = "I need some help with the"
output = server.get_response(prompt)
print("Prompt:", prompt)
print("Output", output)
Also, switching from Flask to Starlette + FastAPI fixed the memory leak.
so is there a memory leak still or not? You shared there still was but then removed the comment.
If there is we should try to isolate it - by first identifying whether it's in accelerate
or the transformers
modeling code by simply repeating the generate call.
So, i don't really think its a memory leak @stas00 https://github.com/huggingface/accelerate/issues/614 I want to look more into this.
THis is what I plan to do: https://github.com/huggingface/accelerate/issues/614#issuecomment-1224285433 Should give us an idea of whether things are good to go or not.
For now, I will investigate this further. Lets not merge this PR yet :)
It looks like the transformers
internal APIs have changed and the old hack I used for get_checkpoint_files
no longer works.
Here is the new drop in replacement and the good thing about it - it's no longer a hack and should continue working:
from pathlib import Path
from transformers.utils import is_offline_mode
from huggingface_hub import snapshot_download
def get_checkpoint_files(model_name_or_path, revision=None):
# checks if online or not
if is_offline_mode():
print("Offline mode: forcing local_files_only=True")
local_files_only = True
else:
local_files_only = False
# loads files from hub
cached_repo_dir = snapshot_download(model_name_or_path, allow_patterns=["*"], local_files_only=local_files_only, revision=revision)
# creates a list of paths from all downloaded files in cache dir, matching the regex *.pt
file_list = [str(entry) for entry in Path(cached_repo_dir).rglob('*.pt') if entry.is_file()]
return file_list
you should be able to use it if you update your transformers
clone to the the latest.
credits: @philschmid
p.s. we should probably add a minimal version requirement for huggingface_hub>=0.9.0
when this is ready to merge. it's already released so you can just add it now as well.
from transformers.utils.versions import require_version
require_version("huggingface_hub>=0.9.0")
p.s. note to myself it doesn't seem to currently respect TRANSFORMERS_CACHE
env var, as it's internal to transformers
need to figure it out.
when I run:
python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
the first input went ok, but when it returned it messed things up:
Loading model...
Model loaded
Input text: whoah!
change generate_kwargs? [y/n] y
Generate kwargs: {"min_length": 100, "max_new_tokens": 100, "do_sample": false}
Output text: whoah!"
"What's the matter?" asked the other.
"Matter?" repeated the first. "Why, I thought I heard a shot."
"So did I," said the other. "But it was only a gun."
"Only a gun!" exclaimed the first. "Then it must have been a gun."
"It was," said the other. "But it was only a gun."
"Only a gun!" repeated
Generated tokens: 100
Input text: change generate_kwargs? [y/n]
as you can see the last line it mixed 2 inputs together
I think it's because I was impatient and entered 'Enter' at some point.
but it shouldn't allow an empty prompt, should it?
Perhaps it should consume all inputs before generating a new prompt, so that only the response to the prompt is used?
Also it'd be nice if the cli.py exited cleanly on Ctrl-C and not continue running and requiring multiple breaks:
Input text: change generate_kwargs? [y/n] ^CTraceback (most recent call last):
File "scripts/bloom-inference-server/cli.py", line 80, in <module>
main()
File "scripts/bloom-inference-server/cli.py", line 61, in main
if (input("change generate_kwargs? [y/n] ") == "y"):
KeyboardInterrupt
^C
^C
^C
The DS-inference server crashes for me w/o the cached tp checkpoint
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
[...]
Traceback (most recent call last):
File "scripts/bloom-inference-server/benchmark.py", line 188, in <module>
main()
File "scripts/bloom-inference-server/benchmark.py", line 179, in main
benchmark_end_to_end(args, DSInferenceModel)
File "scripts/bloom-inference-server/benchmark.py", line 75, in benchmark_end_to_end
model, initialization_time = run_and_log_time(
File "scripts/bloom-inference-server/benchmark.py", line 37, in run_and_log_time
results = execs()
File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/utils/utils.py", line 30, in __call__
return self.func(**self.kwargs)
File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/model.py", line 57, in __init__
self.model = deepspeed.init_inference(
File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/__init__.py", line 293, in init_inference
engine = InferenceEngine(model,
File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/inference/engine.py", line 154, in __init__
self.module.to(device)
File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
return self._apply(convert)
File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
The DS-inference server crashes for me w/o the cached tp checkpoint
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5 [...] Traceback (most recent call last): File "scripts/bloom-inference-server/benchmark.py", line 188, in <module> main() File "scripts/bloom-inference-server/benchmark.py", line 179, in main benchmark_end_to_end(args, DSInferenceModel) File "scripts/bloom-inference-server/benchmark.py", line 75, in benchmark_end_to_end model, initialization_time = run_and_log_time( File "scripts/bloom-inference-server/benchmark.py", line 37, in run_and_log_time results = execs() File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/utils/utils.py", line 30, in __call__ return self.func(**self.kwargs) File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/model.py", line 57, in __init__ self.model = deepspeed.init_inference( File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/__init__.py", line 293, in init_inference engine = InferenceEngine(model, File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/inference/engine.py", line 154, in __init__ self.module.to(device) File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to return self._apply(convert) File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply param_applied = fn(param) File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) NotImplementedError: Cannot copy out of meta tensor; no data!
Already on it: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328#issuecomment-1221609552 This is a bug in deepspeed: https://github.com/microsoft/DeepSpeed/pull/2132#issuecomment-1221592051
@RezaYazdaniAminabadi @jeffra
I feel like empty prompt is fine. This is basically unconditional generation right? I will try to fix the keyboard interrupt.
I feel like empty prompt is fine. This is basically unconditional generation right?
Good question - I have never tried it - I guess it should work.
Also, @stas00 I have been meaning to ask. Is accelerate using DeepSpeed ZeRO as its backend? Because if that is the case then the generation time per batch for both of them differ greatly right?
Is accelerate using DeepSpeed ZeRO as its backend?
It can but it's not using it in our scripts here. It uses a naive pipeline parallelism, assigning different layers to different devices and switching between those.
@stas00 , while working on this, found a really interesting bug. Could be a severe one. I have provided a minimal working example.
from huggingface_hub import snapshot_download
- I am still using the old method (it is still working for >=0.9.0) if the user inputs the model_name as bigscience/bloom in ds-inference. I am doing this because snapshot _download is not respecting the TRANSFORMERS_CACHE and I don't want it to download duplicate checkpoints (for bigscience/bloom). However, this is slow (reshards the 72 files)
- For the newer weights provided by microsoft, I am using snapshot_download. Also, I am adding support for int8 as well in this PR. However, this is much faster.
I feel like this is the best approach for now. If we want to change this we can do that in a separate PR in the future.
@stas00
because snapshot _download is not respecting the TRANSFORMERS_CACHE
If you use HUGGINGFACE_HUB_CACHE
instead of TRANSFORMERS_CACHE
it would work for either.
Alternatively you can do:
HUGGINGFACE_HUB_CACHE = os.getenv("HUGGINGFACE_HUB_CACHE", None)
cache_dir = = os.getenv("TRANSFORMERS_CACHE", HUGGINGFACE_HUB_CACHE)
snapshot_download(..., cache_dir=cache_dir)
If neither is set, None
would lead to the default path.
You can see:
Actually, the first line isn't needed, as it's already the default: https://github.com/huggingface/huggingface_hub/blob/7aadc3243d9478a48e17f568d8d4950cafd11ca2/src/huggingface_hub/_snapshot_download.py#L103-L104
so just:
cache_dir = = os.getenv("TRANSFORMERS_CACHE", None)
snapshot_download(..., cache_dir=cache_dir)
DeepSpeed-MII not working with new Microsoft weights: https://github.com/microsoft/DeepSpeed-MII/issues/50
@stas00 https://huggingface.co/bigscience/bloom/tree/main Some changes were pushed to bigscience/bloom just 3 hours ago. snapshot_download fails as a result of this change.
File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/ds_zero/model.py", line 17, in __init__
return snapshot_download(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 93, in inner_f
downloaded_model_path = get_downloaded_model_path(args.model_name)
File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/utils/model.py", line 105, in get_downloaded_model_path
return f(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 192, in snapshot_download
return snapshot_download(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 93, in inner_f
_ = hf_hub_download(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1218, in hf_hub_download
return f(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 192, in snapshot_download
_ = hf_hub_download(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1218, in hf_hub_download
_create_relative_symlink(blob_path, pointer_path)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 837, in _create_relative_symlink
_create_relative_symlink(blob_path, pointer_path)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 837, in _create_relative_symlink
os.symlink(relative_src, dst)
FileExistsError: [Errno 17] File exists: '../../../../blobs/c4072f2f02baee6f0d50739cf14439f61e0605913ed65606f9bcb11043838a2e' -> '/home/gpttest/nfs/mayank/HF_cache/models--bigscience--bloom/snapshots/874fa447d586d0a28ab66aa3ca49da43ae1ec0f6/tensorboard/main/events.out.tfevents.1649441015.jean-zay-iam50.156467.0'
Any idea how to get around this? Also, if this is fixed, would snapshot_download download the whole checkpoints again? This would be highly impractical right?
Please let me know this. I am not sure of a good solution. Also, clearing cache for every commit might not be practical.
it looks that only README.md was updated, so it's weird that it'd fail.
I just run:
python -c "from huggingfhub import snapshot_download; snapshot_download('bigscience/bloom')"
and it updated w/o any issues.
Probably move that file it complains about for now? Could be some bug in huggingface_hub
?
it shouldn't redownload anything but the new or updated files.
@mayank31398, please also note the renames in this commit https://github.com/microsoft/DeepSpeed/pull/2217/files/c77f5e0e385db405475b2842a6f5725c0f551170..f3f4b1dda9f9ba09042b8b63bbee475c9364bc56 in particular s/mp_size/tp_size/
this is now merged so we should update our side - you can just do a global search and replace on all of the files under scripts - or if you'd prefer I do that please let me know - don't want to step on your toes if you're still on the old DS branch.
Let me try updating to the latest HF_hub, and deepsped to see if that fixes things. Nvm, I am on the latest version. I agree this should work. @stas00 Its working fine for GPT-2. No idea what is going wrong though for BLOOM. For now, Ill just remove this file.
Nevermind, I have fixed this. This was a bug because all 8 processed in my DeepSpeed code were trying to download simultaneously
oddly enough It doesn't look like mp_size
was renamed to tp_size
, which is odd, so I only had to modify this line with the updated checkpoints:
checkpoints_json = os.path.join(repo_root, "ds_inference_config.json")
@mayank31398, would it be OK if we merged this PR - and then perhaps have another iteration of tweaks in subsequent PRs if needed? As we are renaming files I want the rename to be merged first before tweaking those.
You have the commit rights, so you can do the honours if you're in agreement.
Thank you!
Yes @stas00 , currently, MII is not working with DS-inference. But we can merge and I can continue to work on this.
Also, @stas00 there are some pieces of code that both the server folder and the scripts folder can share. I will work on getting the reduntant code out once all the stuff in server stuff is fixed :)
perfect, then you can start a new PR once you have some new edits. We can just do it in iterations.
I need to edit the original scripts to support the new repos. will do in another PR.
Currently, the scripts are working correctly. However, there is some memory leak. In https://github.com/huggingface/accelerate/issues/614 @sgugger says that its not in accelerate which is probably true.
My hunch is a related bug: https://github.com/apache/incubator-mxnet/issues/19159