Open thies1006 opened 2 years ago
The second error appears very sporadically (after thousands of cycles) as well when only using generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=True)
, but only (I think) when changing the input between each cycle, e.g. using:
def generate():
""" returns a list of zipped inputs, outputs and number of new tokens """
random.shuffle(input_sentences)
inputs = input_sentences[:args.batch_size]
I had the same issue. On my side this illegal memory access error only happens for batch size 2 and 4, but with batch size 8 to 32 I can run the inference script. For batch size 1, I observed that this error is intermitted, it only happens when I have a longer input.
Thanks @pai4451. To follow up, my impression is that the script works fine for short texts, but when it comes to longer ones (>200 tokens) it more likely to crash. For me it crashes for all batch sizes I tried (1, 2, 4, 8), the error message is not always the same however. The problem with the sampling was on my side. I guess I ended up with the same problem as in #318.
@RezaYazdaniAminabadi Hi, just for your inference. I have tested https://github.com/microsoft/DeepSpeed/pull/2196 but it seems not resolving the issue of "illegal memory access" at our side.
@pohunghuang-nctu can you confirm your cuda version? I was using 11.6 and getting the same issue. Using 11.3 resolved it for me. Please give it a try. Thanks
@pohunghuang-nctu can you confirm your cuda version? I was using 11.6 and getting the same issue. Using 11.3 resolved it for me. Please give it a try. Thanks
@mayank31398 Thanks for suggestion. We're in pytorch 1.11 + cuda 11.5, and how about your pytorch version? By the way, now you're running bloom on single node (A100 * 8) or multiple node?
@pohunghuang-nctu I have PyTorch installed using conda (with CUDA 11.3) and DeepSpeed and apex have been installed from master branch using CUDA 11.3
@pohunghuang-nctu can you confirm your cuda version? I was using 11.6 and getting the same issue. Using 11.3 resolved it for me. Please give it a try. Thanks
@mayank31398 Thanks for the information. Can I confirm that you no longer encounter the illegal memory access error with CUDA 11.3 and the latest build from master branch DeepSpeed
for the following two cases?
after installing DeepSpeed microsoft/DeepSpeed#2196 and CUDA 11.3?
I also tried the same (CUDA 11.3 and the latest DeepSpeed
from the master branch) but still facing illegal memory access in both cases. Also, I ran the code with two nodes, each consisting of 8x A6000 GPUs, maybe this is the difference?
I haven't played around that much with it. But batch size >1 is working for me.
I only have a single node with 8 GPUS 80GB each. Are you using pipeline parallel across nodes? Does DS-inference support that?
@pai4451 Currently, I limit the token-length for each query to 128. I am gonna increase this soon. But, can you try with smaller length and see if the issue is resolved? Thanks
Regarding the batch size, I have tried with up to 128 batch and it was working fine on my side.
I only have a single node with 8 GPUS 80GB each. Are you using pipeline parallel across nodes? Does DS-inference support that?
@mayank31398 Thanks. I just launched DeepSpeed with an additional hostfile
argument to run on multiple nodes.
@RezaYazdaniAminabadi Yeah, no error will happen when I use input tokens 128 and batch size one. But I found that input tokens longer than 600 will lead to illegal memory access on my two nodes server, with each node having 8x A6000. About different batch sizes, I found that I can run the inference code with batch sizes 1, 8, 16, and 32 but will raise an illegal memory access error with batch sizes 2 and 4.
My environment with Python 3.8 and CUDA 11.3:
torch==1.12.1+cu113
deepspeed==0.7.1+28dfca8a
I only have a single node with 8 GPUS 80GB each. Are you using pipeline parallel across nodes? Does DS-inference support that?
- DS-inference supports multi-node with no doubt.
- Deepspeed MII do NOT support multi-node, because the way calling deepspeed was hardcoded as using "--num_gpus". We have did some customization to allow it accept hostfile to spawn processes into multi-nodes.
@pohunghuang-nctu @pai4451 thanks for letting me know about the multi-node deployment. I am guessing this would be using pipeline parallelism? However, what are the advantages of using multi-node during inference? I am guessing this would be slower than a single node 8x A100 80 GB GPUs right?
@mayank31398 I don’t think there is much advantage on using multi-node for inference. We need multi-node for inference just because we only have several 8x A6000 48GB servers.
@pohunghuang-nctu can you confirm your cuda version? I was using 11.6 and getting the same issue. Using 11.3 resolved it for me. Please give it a try. Thanks
@mayank31398 May I ask about your transformers
and deepspeed
versions? I just found that using the lastest master branch of transformers
makes illegal memory access even worse.
I built deepspeed from source (master branch). Also, transformers is 4.20 transformers (4.21.1) installed using pip
On my side I still get the error RuntimeError: CUDA error: an illegal memory access was encountered
(with 128 input tokens, Cuda 11.3 and batch size 1). However, my impression is that it gets more rare with lower number of input tokens.
deepspeed info ................... 0.7.1+9b418c1e, 9b418c1e deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
I haven't tried adjusting input tokens @thies1006 But I can confirm, I ran with input text = "Hello" and generated tokens from 10, 50, 100, 300, 500, 1000, 2000, 5000. And it didn't crash for me in any scenario.
@mayank31398 From my impression, it is the number of input tokens that matters the illegal memory access error
instead of the number of generated tokens. I can also generate two to three thousand tokens without issue when input is short. But when I increase the input sequence to a certain amount (in my case, above 600 tokens), these errors happen all time.
Maybe if you have some time, could you try to increase the number of input tokens and check whether the issue really doesn't bother on your side?
I see @pai4451. Ill give it a shot.
Probably related: https://github.com/microsoft/DeepSpeed/issues/2062
I am having a similar error running the model on 4 4xA100 40GB cards with batch size 1. After about 18 examples, I get the Cuda illegal access error. This does not seem to be closely related to the sequence length though, because it happens at about the same time whether use priming examples or not (which increases the prompt length significantly).
I am on CUDA 11.3, pytorch 1.10 and deepspeed 0.7.0 Edit: I can confirm this is still happening with deepspeed 0.7.1.
@RezaYazdaniAminabadi any followup on this? I am facing similar CUDA issues with longer input sequence lengths.
Hi @mayank31398,
I am still working on this. Can I ask what an average maximum number of tokens for an input would be? Potentially, this can go to as many tokens as user requests, but unfortunately there is a limit on what can be cached. Thanks, Reza
@RezaYazdaniAminabadi I am also not sure but BLOOM is trained using ALiBi, ideally there should be no limit. I understand that this might not be possible. But GPT-3 allowed input + generated tokens = 4000 tokens. Is that a target you think might be possible?
@RezaYazdaniAminabadi I can share my findings. I use two 8x A6000 (48G) nodes for inference, and when the input tokens more than 600 it will always lead to the CUDA illegal memory access error (no matter what value I set for the number of output tokens).
Hi @mayank31398,
I am still working on this. Can I ask what an average maximum number of tokens for an input would be? Potentially, this can go to as many tokens as user requests, but unfortunately there is a limit on what can be cached. Thanks, Reza
@RezaYazdaniAminabadi I often run such models with ~1000 input tokens and generate ~500 tokens.
I'm running the inference script
bloom-ds-inference.py
by invoking:deepspeed --num_gpus 1 ~/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom-1b3 --benchmark
, but I change the generation arguments togenerate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False, use_cache=False)
(adding use_cache option).Error:
When using
generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=True, use_cache=False)
I get a different error: