Failed to run GPT-Neox demo on an inf2.24xlarge instance

dacorvo commented 1 year ago

I tried to run the gpt-neox demo CLI @0.4.60 on an inf2.24xlarge instance.

I first saved the model with float16 precision:

$ gptneox_demo --amp f16 save gpt-neox-20b

When trying a conversion and inference, I ran out-of-memory:

$ gptneox_demo --amp f16 run --batch_size 1 gpt-neox-20b
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 52.7kB/s]
Downloading (…)olve/main/vocab.json: 1.08MB [00:00, 18.6MB/s]
Downloading (…)olve/main/merges.txt: 457kB [00:00, 12.7MB/s]
Downloading (…)/main/tokenizer.json: 2.11MB [00:00, 17.8MB/s]
Downloading (…)cial_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 35.0kB/s]
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu_fast" ignored in favor of hidden_act="gelu_new"
  warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
.........Selecting 57055 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Selecting 3704 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
...Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**********************************.*****************
Dependency reduction of sg0000
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.
Compiler status PASS
2023-Jun-23 09:11:18.0735  6737:6737  ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 150994944
2023-Jun-23 09:11:18.0738  6737:6737  ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_649561b6.csv
2023-Jun-23 09:11:18.0738  6737:6737  ERROR  TDRV:log_dev_mem                             Failed to allocate 144.000MB (usage: tensors) on ND 0:NC 0, current utilization:
    * total: 15.951GB
    * tensors: 15.951GB
    * runtime: 1.062KB
    * dma rings: 32.000KB

2023-Jun-23 09:11:18.0738  6737:6737  ERROR  TDRV:tensor_allocate                         Failed to allocate 150994944 bytes on DEVICE for tensor UNKNOWN.
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/gptneox_demo", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/demo.py", line 28, in main
    demo('EleutherAI/gpt-neox-20b', GPTNeoXForSampling, amp_callback)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 61, in demo
    run(args, model_name, model_cls)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 105, in run
    model.to_neuron()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py", line 71, in to_neuron
    block.to_neuron(n_positions_list)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py", line 285, in to_neuron
    self.mlp_out_weight = shard_along(mlp.dense_4h_to_h.weight.detach().T, dim=0)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/parallel.py", line 109, in shard_along
    return ops.parallel_to_nc(self.shard_along_on_cpu(tensor, dim))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/ops.py", line 49, in parallel_to_nc
    return torch.ops.neuron._parallel_to_neuron(tensors)
  File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 442, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: nrt_tensor_allocate status=4

This was somehow expected as by default the CLI uses two neuron cores only and the model is quite big (20 B parameters).

By increasing the number of neuron cores, I was able to run the model, but the result is garbage:

$ gptneox_demo --amp f16 run --batch_size 1 --tp_degree 4 gpt-neox-20b
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu_fast" ignored in favor of hidden_act="gelu_new"
  warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
....Selecting 31504 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
..Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Dependency reduction of sg0000
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

Compiler status PASS
2023-Jun-23 09:16:14.0528 7163:7163 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jun-23 09:16:14.0528 7163:7163 [0] init.cc:99 CCOM WARN OFI plugin initNet() failed is EFA enabled?
running model.sample
generated_sequence= tensor([[12092,    13,   309,  1353,   247,  3448,  1566,    13, 29589, 22702,
          8822, 22702, 42010, 22702, 22702,  8834, 42010, 42010, 42010, 42010,
         42010, 22702, 42010, 22702, 22702, 42010, 29589, 42010, 22702, 42010,
         42010, 42010, 42010, 22702,  8822, 22702, 22702, 42010, 42010, 22702,
         42010, 42010, 42010, 42010, 42010, 22702, 22702,  8834, 42010, 42010,
         42010, 42010, 22702, 42010, 22702, 42010, 29589, 42010, 22702, 42010,
         42010, 22702, 42010, 42010, 42010, 42010, 42010, 22702, 22702, 42010,
         42010, 42010, 42010, 42010, 22702, 42010, 42010, 42010, 42010, 22702,
         42010, 42010, 42010, 22702, 42010, 22702, 42010, 42010, 22702, 42010,
         42010, 42010, 22702, 42010, 42010, 42010, 42010, 22702, 42010, 42010,
         22702, 42010, 22702, 42010, 42010, 42010, 42010,  8822,  8828, 22702,
         42010, 29589, 42010, 22702, 42010, 42010, 42010, 42010, 22702, 42010,
         42010,  8828, 42010, 22702, 29589, 29589,  8828, 22702]])
["Hello, I'm a language model,blockList errnoErramssymb errnoErr BytePtrFromString errnoErr errnoErramsfonts BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErramssymb errnoErr errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr errnoErramsfonts BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromStringamssymbmathrsfs errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromStringmathrsfs BytePtrFromString errnoErrblockListblockListmathrsfs errnoErr"]

aws-donkrets commented 1 year ago

dacorvo Thanks for posting this ticket. We are investigating the issue and believe we have identified a fix. We are testing and will update this ticket with more info.

dacorvo commented 1 year ago

@aws-donkrets I had time to come back to this issue and I suspect this is related to the fact that the current transformers-neuronx optimized graphs only support the gelu_new activation function used in GPT2, where the GPT-NeoX base models from EleutherAI are using gelu_fast. Can you confirm ?

If I am correct then I can create a better issue listing the GeLU flavors that need to be supported to be able to run the most popular models from the Hugging Face hub.

jeffhataws commented 1 year ago

This is a duplicate of https://github.com/aws-neuron/transformers-neuronx/issues/12 . We will have the fix in an upcoming release.

dacorvo commented 1 year ago

Can you confirm this is fixed with latest release ?

jeffhataws commented 1 year ago

Hi @dacorvo ,

I confirmed that the GPT-Neox demo is working with release 2.12:

gptneox_demo --amp f16 save gpt-neox-20b; gptneox_demo --amp f16 run --batch_size 1 --tp_degree 4 gpt-neox-20b
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [05:12<00:00,  6.80s/it]
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu_fast" ignored in favor of hidden_act="gelu_new"
  warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
.......
Compiler status PASS
running model.sample
generated_sequence= tensor([[12092,    13,   309,  1353,   247,  3448,  1566,    13,   513,   368,
          2564,   604,   309,  1361,   368,   247,  1652,  2372,   865,   187,
           187,  2773,   434,   835,   776,   747,  3210,  1705,   275,    15,
          1583,  1472,   253, 26101,  4302,   432,  8217,   285,  5559,   326,
          1581,   441,   281,   513,   326,    15,   831,  2934,   273,  5145,
          4715,   285,   849,   352,   588,  1361, 12823,  1805,  2096,   441,
           310,   271, 12302,   581,    15,   733,   434,   271,  2170,   326,
           434,   644,   275,  2440,   323,  1142,  1107,    15,   733,   434,
           760,  4102,   326, 12823,   452,  4925,   247,  1127,   835,   597,
           476,   513,  1633,  4217,   342,   253,   941,   597,   452,    15,
           844,  1849,   760,   644,  2104,   281,   513,  5145, 10234,   342,
           247,  1943,  4382,    13,   247,  2221, 32948,    13,   323,   247,
          1643,  8007,    15,   733,  2335,  3240, 36521,  1078]])

['Hello, I\'m a language model, do you mind if I help you a little bit?"\n\nThat\'s where our new models come in. They\'re the newest technology from Apple and Google that allow us to do that. This idea of machine learning and how it will help computers better understand us is an exciting one. It\'s an area that\'s been in development for many years. It\'s only recently that computers have reached a point where they can do something useful with the data they have. We\'ve only been able to do machine translation with a big computer, a supercomputer, for a few decades. It took quite awhile before']

Packages:

(aws_neuron_venv_pytorch) ubuntu@ip-10-0-10-149:~$ pip list | grep neuron
aws-neuronx-runtime-discovery 2.9
libneuronxla                  0.5.391
neuronx-cc                    2.8.0.25+a3ad0f342
neuronx-distributed           0.1.0
neuronx-hwm                   2.8.0.3+2b7c6da39
torch-neuronx                 1.13.1.1.9.0
torch-xla                     1.13.1+torchneuron8
transformers-neuronx          0.5.58

aws-neuron / transformers-neuronx

Failed to run GPT-Neox demo on an inf2.24xlarge instance #10