aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
442 stars 145 forks source link

Is it possible to compile and create model graph for Inf2 instance on a CPU only instance using neuronx-cc compiler #942

Open samarth1612 opened 1 month ago

samarth1612 commented 1 month ago

I want to compile a model but not on Inf2 or Trn1 instance rather I want to compile the model on a CPU only instance like say c5 instance on AWS. Is this possible and if so what shall be done to achieve this. I have this doubt as model compilation takes a lot of time on the instance and at the end the compilation is done on CPU only so won't it be better off to be done one CPU only instance and then deploy them on Inf2 instance with the .neff files there.

I want to run a command similar to the one mentioned below neuronx-cc compile /D/Llama-2-7b-hf/ --framework XLA --target inf2 --model-type transformer --auto-cast-type bf16 --output llama2-7b.neff

mrnikwaws commented 1 month ago

Hi @samarth1612,

Yes this is possible. If you install neuronx-cc in a python environment this should work fine. Setting up the environment to use torch_neuronx.trace is a little more complex but should also be doable.

samarth1612 commented 1 month ago

Thanks @mrnikwaws for letting me know this.

System specs:

Architecture:           x86_64
CPU op-mode(s):       32-bit, 64-bit
Vendor ID:              GenuineIntel
Model name:           Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz

OS Specs:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

I tried creating a venv in my system and below packaged and their dependencies are installed.

_aws-neuronx-runtime-discovery==2.9
libneuronxla==2.0.2335
neuronx-cc==2.14.227.0+2d4f85be
torch-neuronx==2.1.2.2.2.0
transformers-neuronx==0.11.351_

Also I tried running below command where I had the llama 2 7B model from hugging face with .bin and .safetensors as the model weights _neuronx-cc compile /D/Llama-2-7b-hf/ --framework XLA --target inf2 --model-type transformer --auto-cast-type bf16 --output llama2-7b.neff_ I am getting the following error after executing this

Process Process-1:
Traceback (most recent call last):
  File "neuronxcc/driver/CommandDriver.py", line 343, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
  File "neuronxcc/driver/commands/CompileCommand.py", line 1277, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
  File "neuronxcc/driver/commands/CompileCommand.py", line 1228, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/commands/CompileCommand.py", line 1248, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/commands/CompileCommand.py", line 1251, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/Job.py", line 346, in neuronxcc.driver.Job.SingleInputJob.run
  File "neuronxcc/driver/Job.py", line 372, in neuronxcc.driver.Job.SingleInputJob.runOnState
  File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
  File "neuronxcc/driver/Job.py", line 346, in neuronxcc.driver.Job.SingleInputJob.run
  File "neuronxcc/driver/Job.py", line 372, in neuronxcc.driver.Job.SingleInputJob.runOnState
  File "neuronxcc/driver/jobs/Frontend.py", line 431, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
  File "neuronxcc/driver/jobs/Frontend.py", line 210, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
  File "neuronxcc/driver/jobs/Frontend.py", line 186, in neuronxcc.driver.jobs.Frontend.Frontend.runHlo2Tensorizer
neuronxcc.driver.Exceptions.CompilerInvalidInputException: ERROR: Failed command  /home/samarth/aws_neuron_venv_pytorch/lib/python3.10/site-packages/neuronxcc/starfish/bin/hlo2penguin --input /D/Llama-2-7b-hf/ --out-dir ./ --output penguin.py --remat --max-costly-ops=2 --max-live-in-size=5 --max-remat-chain-size=10 --max-mem-multiple=1.8 --min-def-use-distance=500 --remat-policy=transformer --allow-same-pass-remat=true --split-abc --layers-per-module=1 --coalesce-all-gathers=false --coalesce-reduce-scatters=false --coalesce-all-reduces=false --emit-tensor-level-dropout-ops --emit-tensor-level-rng-ops --native-to-custom-softmax --partitioner-opts='--transformer'
------------
Reported stdout:
terminate called after throwing an instance of 'nlohmann::json_abi_v3_11_3::detail::parse_error'
  what():  [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON

------------
Reported stderr:
None
------------
Import of the HLO graph into the Neuron Compiler has failed.
This may be caused by unsupported operators or an internal compiler error.
More details can be found in the error message(s) above.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "neuronxcc/driver/CommandDriver.py", line 350, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand_in_process
  File "neuronxcc/driver/CommandDriver.py", line 345, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
  File "neuronxcc/driver/CommandDriver.py", line 111, in neuronxcc.driver.CommandDriver.handleError
  File "neuronxcc/driver/GlobalState.py", line 102, in neuronxcc.driver.GlobalState.FinalizeGlobalState
  File "neuronxcc/driver/GlobalState.py", line 82, in neuronxcc.driver.GlobalState._GlobalStateImpl.shutdown
  File "/usr/lib/python3.10/shutil.py", line 715, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 713, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/samarth/neuronxcc-q8xqry4_'

Not sure what exactly is the issue can you please guide and help me in this.

mrnikwaws commented 1 month ago

Hi @samarth1612,

It looks like you have misunderstood what the neuron compiler compiles. For inf2 and trn1 neuronx-cc takes an input HLO protobuf file. Here you are just providing a directory and (it looks like) transformers weight files. This won't work. In general using the compiler from a CPU instance is an advanced usage which I wouldn't recommend unless you understand all components of the Neuron system well.

If you want to compile transformer_neuronx you need to follow the directions here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#compile-time-configurations. This will not work on a cpu instance in the way you are hoping (this is due to the way transformers_neuronx is designed).

You can explore working with the persistent cache: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html (in particular S3 caching) which may do what you want (avoiding compilation at deployment time, and using S3 as the deployment medium for compiled NEFFs). Also check out https://huggingface.co/docs/optimum-neuron/en/guides/cache_system which may meet your needs.

mrnikwaws commented 1 month ago

I'll leave this ticket open for a day or two for you to respond, but if we don't hear back I plan to close this ticket later this week.

samarth1612 commented 1 month ago

Thanks @mrnikwaws ,

I have already tried compiling it on the inf2 instances using the aws documentation and was able to successfully do the inference, and from the above links you provided, I understood that for neuronx-cc to compile the model, it does require Neuron cores to be present on the hardware right ? As this is the way transformer_neuronx works.

Hence what I actually wanted to do is not possible as the during the compile time the compiler requires Neuron cores present on the hardware and so it's not possible to compile it just using a CPU instance.

If this is correct then I got my answer.

Thanks a lot @mrnikwaws for help.