Add generation server scripts using HF accelerate and DS-inference

mayank31398 commented 2 years ago

Currently, the scripts are working correctly. However, there is some memory leak. In https://github.com/huggingface/accelerate/issues/614 @sgugger says that its not in accelerate which is probably true.

My hunch is a related bug: https://github.com/apache/incubator-mxnet/issues/19159

mayank31398 commented 2 years ago

This is still WIP @stas00

mayank31398 commented 2 years ago

@stas00, I believe the code is in good shape. You can start reviewing. Also, I'll start updating the README for instructions tomorrow

stas00 commented 2 years ago

Oh wow, you developed a whole package around this. Impressive work, @mayank31398

Let's just discuss the user-facing APIs before implementing to minimize wasting time.

Originally, these were just scripts to benchmark and demo how the various solutions can be used, but now I'm not so sure how demo-friendly these are with so many separate files.

Some design questions:

Why does CLI import from the server code?
would it really make things simpler to have one cli script for all the different systems? they are quite different after all with very different arguments - e.g. ds-zero will need nvmi path as well and flags for cpu/nvme offload? I just had those hardcoded - but there are 3 ways to invoke this one. gpu/cpu-offload/nvme-offload
the while loop in cli.py isn't debug friendly - can we activate it only with an explicit flag, but by default just use pre-made inputs - or perhaps 2 separate scripts? I'm not even sure who would ever want to manually type text - perhaps having a flag to feed it a file with prompts instead? not sure.
multiple input BS>1 functionality got lost in transition
naming of the scripts lost their mnemonics - cli.py is very ambiguous since it tells me nothing of what it does. Please see my suggestions at https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328#pullrequestreview-1070168477, but I suppose that the way it's going it's no longer bloom-specific either, and can probably work with any other model.

I'm also no longer sure what to do with this package. Originally the plan was to stash the original demo scripts under transformers' research files, but now it's growing towards an independent package. Not a bad thing, I'm just thinking aloud.

mayank31398 commented 2 years ago

I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.
I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.
I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.
batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now).
I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.

Let me know your thoughts on this. Thanks for the review :)

mayank31398 commented 2 years ago

Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM. I can open up another PR to support other models. Adding so much functionality in a single PR might not be the best thing to do.

stas00 commented 2 years ago

I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.

so you are not benchmarking the same thing then.

you don't need to broadcast anything for non-server usage. It sounds like you built cli.py as a client to the server, but that's too complicated for someone who just wants to process a few inputs.

The intention of the cli was the way it was in the original code - tokenize + generate + detokenize - done.

Perhaps we have miscommunicated here.

2. I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.

I answered earlier in the code comment.

3. I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.

I meant the script should work out of the box and require no user input. I think we have a conflict of use cases here. I wanted an easy to understand and replicate demo, you want a standalone system.

Perhaps the right solution is to just have the original scripts as they are for the demo/benchmarking purpose and then your package that is for the production use. What do you think? There will be a bit of duplication, but as you're saying you don't care for solutions that are not fast enough, which is fair for your needs.

What crashes?

4. batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now).

5. I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.

for end user scripts there should be no need to import them.

Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM. I can open up another PR to support other models. Adding so much functionality in a single PR might not be the best thing to do.

agreed, let's stick to bloom for now. And that's why I suggested that then the naming of the scripts should indicate that it's bloom and your version doesn't.

mayank31398 commented 2 years ago

you don't need to broadcast anything for non-server usage. It sounds like you built cli.py as a client to the server, but that's too complicated for someone who just wants to process a few inputs.

cli works this way. The user only needs to provide input_text. Also, I need to call generate method on all 8 GPU processes (in non-server usage), and since the input text will only be requested on 1 process. You need to broadcast it as far as I understand. Yes cli for ds-inference is built as a client to the server. I can change it though if we want. In the original ds-inference script @stas00 , you didnt need to broadcast since each process was creating the list.

I will incorporate your suggestions.

mayank31398 commented 2 years ago

So, just to summarize for cli.py: You want the option for user to provide an input text file?

mayank31398 commented 2 years ago

Also, @stas00 , I am fine with both. If we want, we can keep this branch as a separate branch if users want to leverage a standalone system and maybe just add an instruction to the README in main branch to use this branch for deployment purposes. What do you think?

stas00 commented 2 years ago

branch, no, as branches are hard to keep in sync - and especially since we will move it out of Meg-DS anyway once you're done . I propose let's create 2 sub-folders

bloom-inference-scripts
bloom-inference-server

one for the straight scripts - i.e. move the current ones there instead of deleting those.
the 2nd is for your sophisticated set up - and there you don't need to support ZeRO at all. And you can leave everything as it is now in this PR.

I think that would allow meeting different use cases w/o getting in the way of each other.

I know I was the one who proposed to refactor to support your server work, but now I can see that keeping both versions is probably the simplest design choice.

stas00 commented 2 years ago

So, just to summarize for cli.py: You want the option for user to provide an input text file?

If we keep the original scripts as they there are I think we can keep your version as it is for now w/o any cmd args changes. If someone asks for it it can be added later.

mayank31398 commented 2 years ago

Thanks for the comments Stas. Will incorporate the changes

I have added back the older scripts in scripts/inference folder (original path). Will update README.md

stas00 commented 2 years ago

I have added back the older scripts in scripts/inference folder (original path).

and let's move the 2 solutions to their respective sub-dirs:

bloom-inference-scripts
bloom-inference-server

we don't want them to be together, as it'd be difficult to understand quickly.

thank you, @mayank31398

mayank31398 commented 2 years ago

I have added back the older scripts in scripts/inference folder (original path).

and let's move the 2 solutions to their respective sub-dirs:
bloom-inference-scripts
bloom-inference-server
we don't want them to be together, as it'd be difficult to understand quickly.

thank you, @mayank31398

I assume these 2 folders should be located inside scripts/ ? or in the root folder?

stas00 commented 2 years ago

I assume these 2 folders should be located inside scripts/

yes, please.

mayank31398 commented 2 years ago

Done. Added README too @stas00

mayank31398 commented 2 years ago

I want to remove the temporary json created while caching. However, the new commit in DeepSpeed master branch is not working for me. https://github.com/microsoft/DeepSpeed/pull/2132#issuecomment-1221592051 Waiting for a fix @RezaYazdaniAminabadi

stas00 commented 2 years ago

Looks great, @mayank31398 - thank you!

I need to think how I could test the functionality. Probably might work from the local client on the same host (as the node I have access to has no outside access).

mayank31398 commented 2 years ago

I see @stas00 I have a script that does that for you. Are you saying the node you have can only be accessed from a login node or something? If that is how your setup is, I have a similar setup. Our compute nodes have no outside access, only the login node. If I am misunderstanding the probem, ignore this :trollface: I use the following script to test from my local machine:

from sshtunnel import SSHTunnelForwarder
import http
import requests
import paramiko


class BloomServer:

    def __init__(self, bloom_ip="<IP HERE>", bloom_port=<PORT HERE>, relative_path="/generate/",
                 private_key_path = "<PATH TO ID_RSA>"): 

        pkey = paramiko.RSAKey.from_private_key_file(private_key_path)
        self.remote_host = "<SSH IP LOGIN NODE>"
        self.remote_user = "<USERNAME>"
        self.remote_port = <PORT FOR LOGIN NODE>
        self.bloom_ip = bloom_ip
        self.bloom_port = bloom_port
        self.server = SSHTunnelForwarder(
                ssh_address_or_host=self.remote_host,
                ssh_username=self.remote_user,
                ssh_private_key=pkey,
                remote_bind_address=(self.bloom_ip, self.bloom_port),
                ) 
        self.server.skip_tunnel_checkup = False
        self.server.start()
        self.server.check_tunnels()
        print(self.server.tunnel_is_up, flush=True)

        server_local_url =  "127.0.0.1:" + str(self.server.local_bind_address[1])
        self.local_url = "http://" + server_local_url + relative_path

    def get_response(self, prompt):
        min_length = 1
        top_k = 5
        top_p =  1
        temperature = 1
        max_new_tokens = 10
        request_body = {
            "text": prompt,
            "min_length": min_length,
            "top_p": top_p,
            "top_k":top_k,
            "temperature": temperature,
            "max_new_tokens": max_new_tokens
        }
        try:
            response = requests.post(self.local_url, json=request_body, verify=False)
            return response.json(), "OK"
        except:
            return {}, "failed"


if __name__=="__main__":
    server = BloomServer()
    prompt = "I need some help with the"
    output = server.get_response(prompt)
    print("Prompt:", prompt)
    print("Output", output)

mayank31398 commented 2 years ago

Also, switching from Flask to Starlette + FastAPI fixed the memory leak.

stas00 commented 2 years ago

so is there a memory leak still or not? You shared there still was but then removed the comment.

If there is we should try to isolate it - by first identifying whether it's in accelerate or the transformers modeling code by simply repeating the generate call.

mayank31398 commented 2 years ago

So, i don't really think its a memory leak @stas00 https://github.com/huggingface/accelerate/issues/614 I want to look more into this.

THis is what I plan to do: https://github.com/huggingface/accelerate/issues/614#issuecomment-1224285433 Should give us an idea of whether things are good to go or not.

mayank31398 commented 2 years ago

For now, I will investigate this further. Lets not merge this PR yet :)

stas00 commented 2 years ago

It looks like the transformers internal APIs have changed and the old hack I used for get_checkpoint_files no longer works.

Here is the new drop in replacement and the good thing about it - it's no longer a hack and should continue working:

from pathlib import Path
from transformers.utils import is_offline_mode
from huggingface_hub import snapshot_download

def get_checkpoint_files(model_name_or_path, revision=None):
    # checks if online or not
    if is_offline_mode():
        print("Offline mode: forcing local_files_only=True")
        local_files_only = True
    else:
        local_files_only = False

    # loads files from hub
    cached_repo_dir = snapshot_download(model_name_or_path, allow_patterns=["*"], local_files_only=local_files_only, revision=revision)

    # creates a list of paths from all downloaded files in cache dir, matching the regex *.pt
    file_list = [str(entry) for entry in Path(cached_repo_dir).rglob('*.pt') if entry.is_file()]
    return file_list

you should be able to use it if you update your transformers clone to the the latest.

credits: @philschmid

p.s. we should probably add a minimal version requirement for huggingface_hub>=0.9.0 when this is ready to merge. it's already released so you can just add it now as well.

from transformers.utils.versions import require_version
require_version("huggingface_hub>=0.9.0")

p.s. note to myself it doesn't seem to currently respect TRANSFORMERS_CACHE env var, as it's internal to transformers need to figure it out.

stas00 commented 2 years ago

when I run:

python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'

the first input went ok, but when it returned it messed things up:

Loading model...
Model loaded
Input text: whoah!
change generate_kwargs? [y/n] y
Generate kwargs: {"min_length": 100, "max_new_tokens": 100, "do_sample": false}

Output text: whoah!"

"What's the matter?" asked the other.

"Matter?" repeated the first. "Why, I thought I heard a shot."

"So did I," said the other. "But it was only a gun."

"Only a gun!" exclaimed the first. "Then it must have been a gun."

"It was," said the other. "But it was only a gun."

"Only a gun!" repeated
Generated tokens: 100
Input text: change generate_kwargs? [y/n]

as you can see the last line it mixed 2 inputs together

I think it's because I was impatient and entered 'Enter' at some point.

but it shouldn't allow an empty prompt, should it?

Perhaps it should consume all inputs before generating a new prompt, so that only the response to the prompt is used?

Also it'd be nice if the cli.py exited cleanly on Ctrl-C and not continue running and requiring multiple breaks:

Input text: change generate_kwargs? [y/n] ^CTraceback (most recent call last):
  File "scripts/bloom-inference-server/cli.py", line 80, in <module>
    main()
  File "scripts/bloom-inference-server/cli.py", line 61, in main
    if (input("change generate_kwargs? [y/n] ") == "y"):
KeyboardInterrupt

^C
^C
^C

stas00 commented 2 years ago

The DS-inference server crashes for me w/o the cached tp checkpoint

deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference  --benchmark_cycles 5
[...]
Traceback (most recent call last):
  File "scripts/bloom-inference-server/benchmark.py", line 188, in <module>
    main()
  File "scripts/bloom-inference-server/benchmark.py", line 179, in main
    benchmark_end_to_end(args, DSInferenceModel)
  File "scripts/bloom-inference-server/benchmark.py", line 75, in benchmark_end_to_end
    model, initialization_time = run_and_log_time(
  File "scripts/bloom-inference-server/benchmark.py", line 37, in run_and_log_time
    results = execs()
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/utils/utils.py", line 30, in __call__
    return self.func(**self.kwargs)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/model.py", line 57, in __init__
    self.model = deepspeed.init_inference(
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/__init__.py", line 293, in init_inference
    engine = InferenceEngine(model,
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/inference/engine.py", line 154, in __init__
    self.module.to(device)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

mayank31398 commented 2 years ago

The DS-inference server crashes for me w/o the cached tp checkpoint

deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference  --benchmark_cycles 5
[...]
Traceback (most recent call last):
  File "scripts/bloom-inference-server/benchmark.py", line 188, in <module>
    main()
  File "scripts/bloom-inference-server/benchmark.py", line 179, in main
    benchmark_end_to_end(args, DSInferenceModel)
  File "scripts/bloom-inference-server/benchmark.py", line 75, in benchmark_end_to_end
    model, initialization_time = run_and_log_time(
  File "scripts/bloom-inference-server/benchmark.py", line 37, in run_and_log_time
    results = execs()
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/utils/utils.py", line 30, in __call__
    return self.func(**self.kwargs)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/model.py", line 57, in __init__
    self.model = deepspeed.init_inference(
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/__init__.py", line 293, in init_inference
    engine = InferenceEngine(model,
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/inference/engine.py", line 154, in __init__
    self.module.to(device)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

Already on it: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328#issuecomment-1221609552 This is a bug in deepspeed: https://github.com/microsoft/DeepSpeed/pull/2132#issuecomment-1221592051

@RezaYazdaniAminabadi @jeffra

mayank31398 commented 2 years ago

I feel like empty prompt is fine. This is basically unconditional generation right? I will try to fix the keyboard interrupt.

stas00 commented 2 years ago

I feel like empty prompt is fine. This is basically unconditional generation right?

Good question - I have never tried it - I guess it should work.

mayank31398 commented 2 years ago

Also, @stas00 I have been meaning to ask. Is accelerate using DeepSpeed ZeRO as its backend? Because if that is the case then the generation time per batch for both of them differ greatly right?

stas00 commented 2 years ago

Is accelerate using DeepSpeed ZeRO as its backend?

It can but it's not using it in our scripts here. It uses a naive pipeline parallelism, assigning different layers to different devices and switching between those.

mayank31398 commented 2 years ago

@stas00 , while working on this, found a really interesting bug. Could be a severe one. I have provided a minimal working example.

mayank31398 commented 2 years ago

from huggingface_hub import snapshot_download

I am still using the old method (it is still working for >=0.9.0) if the user inputs the model_name as bigscience/bloom in ds-inference. I am doing this because snapshot _download is not respecting the TRANSFORMERS_CACHE and I don't want it to download duplicate checkpoints (for bigscience/bloom). However, this is slow (reshards the 72 files)

For the newer weights provided by microsoft, I am using snapshot_download. Also, I am adding support for int8 as well in this PR. However, this is much faster.

I feel like this is the best approach for now. If we want to change this we can do that in a separate PR in the future.

@stas00

stas00 commented 2 years ago

because snapshot _download is not respecting the TRANSFORMERS_CACHE

If you use HUGGINGFACE_HUB_CACHE instead of TRANSFORMERS_CACHE it would work for either.

Alternatively you can do:

HUGGINGFACE_HUB_CACHE = os.getenv("HUGGINGFACE_HUB_CACHE", None)
cache_dir =  = os.getenv("TRANSFORMERS_CACHE", HUGGINGFACE_HUB_CACHE)
snapshot_download(..., cache_dir=cache_dir)

If neither is set, None would lead to the default path.

You can see:

https://github.com/huggingface/transformers/blob/21f6f58721dd9154357576be6de54eefef1f1818/src/transformers/utils/hub.py#L92

stas00 commented 2 years ago

Actually, the first line isn't needed, as it's already the default: https://github.com/huggingface/huggingface_hub/blob/7aadc3243d9478a48e17f568d8d4950cafd11ca2/src/huggingface_hub/_snapshot_download.py#L103-L104

so just:

cache_dir =  = os.getenv("TRANSFORMERS_CACHE", None)
snapshot_download(..., cache_dir=cache_dir)

mayank31398 commented 2 years ago

DeepSpeed-MII not working with new Microsoft weights: https://github.com/microsoft/DeepSpeed-MII/issues/50

mayank31398 commented 2 years ago

@stas00 https://huggingface.co/bigscience/bloom/tree/main Some changes were pushed to bigscience/bloom just 3 hours ago. snapshot_download fails as a result of this change.

  File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/ds_zero/model.py", line 17, in __init__
    return snapshot_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 93, in inner_f
    downloaded_model_path = get_downloaded_model_path(args.model_name)
  File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/utils/model.py", line 105, in get_downloaded_model_path
    return f(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 192, in snapshot_download
    return snapshot_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 93, in inner_f
    _ = hf_hub_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1218, in hf_hub_download
    return f(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 192, in snapshot_download
    _ = hf_hub_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1218, in hf_hub_download
    _create_relative_symlink(blob_path, pointer_path)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 837, in _create_relative_symlink
    _create_relative_symlink(blob_path, pointer_path)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 837, in _create_relative_symlink
    os.symlink(relative_src, dst)
FileExistsError: [Errno 17] File exists: '../../../../blobs/c4072f2f02baee6f0d50739cf14439f61e0605913ed65606f9bcb11043838a2e' -> '/home/gpttest/nfs/mayank/HF_cache/models--bigscience--bloom/snapshots/874fa447d586d0a28ab66aa3ca49da43ae1ec0f6/tensorboard/main/events.out.tfevents.1649441015.jean-zay-iam50.156467.0'

Any idea how to get around this? Also, if this is fixed, would snapshot_download download the whole checkpoints again? This would be highly impractical right?

Please let me know this. I am not sure of a good solution. Also, clearing cache for every commit might not be practical.

stas00 commented 2 years ago

it looks that only README.md was updated, so it's weird that it'd fail.

I just run:

python -c "from huggingfhub import snapshot_download; snapshot_download('bigscience/bloom')"

and it updated w/o any issues.

Probably move that file it complains about for now? Could be some bug in huggingface_hub?

it shouldn't redownload anything but the new or updated files.

stas00 commented 2 years ago

@mayank31398, please also note the renames in this commit https://github.com/microsoft/DeepSpeed/pull/2217/files/c77f5e0e385db405475b2842a6f5725c0f551170..f3f4b1dda9f9ba09042b8b63bbee475c9364bc56 in particular s/mp_size/tp_size/

this is now merged so we should update our side - you can just do a global search and replace on all of the files under scripts - or if you'd prefer I do that please let me know - don't want to step on your toes if you're still on the old DS branch.

mayank31398 commented 2 years ago

Let me try updating to the latest HF_hub, and deepsped to see if that fixes things. Nvm, I am on the latest version. I agree this should work. @stas00 Its working fine for GPT-2. No idea what is going wrong though for BLOOM. For now, Ill just remove this file.

mayank31398 commented 2 years ago

Nevermind, I have fixed this. This was a bug because all 8 processed in my DeepSpeed code were trying to download simultaneously

stas00 commented 2 years ago

oddly enough It doesn't look like mp_size was renamed to tp_size, which is odd, so I only had to modify this line with the updated checkpoints:

checkpoints_json = os.path.join(repo_root, "ds_inference_config.json")

stas00 commented 2 years ago

@mayank31398, would it be OK if we merged this PR - and then perhaps have another iteration of tweaks in subsequent PRs if needed? As we are renaming files I want the rename to be merged first before tweaking those.

You have the commit rights, so you can do the honours if you're in agreement.

Thank you!

mayank31398 commented 2 years ago

Yes @stas00 , currently, MII is not working with DS-inference. But we can merge and I can continue to work on this.

mayank31398 commented 2 years ago

Also, @stas00 there are some pieces of code that both the server folder and the scripts folder can share. I will work on getting the reduntant code out once all the stuff in server stuff is fixed :)

stas00 commented 2 years ago

perfect, then you can start a new PR once you have some new edits. We can just do it in iterations.

I need to edit the original scripts to support the new repos. will do in another PR.

bigscience-workshop / Megatron-DeepSpeed

Add generation server scripts using HF accelerate and DS-inference #328