aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications
Other
101 stars 32 forks source link

Model Compilation Issue in AWS Neuron Environment #76

Closed ShivamB25 closed 2 weeks ago

ShivamB25 commented 1 month ago

Description

After running the code until the compilation part, the models do not exist. The compilation logs indicate that the process completes without errors, but the expected model file model.pt is missing from the directory sd2_compile_dir_768/unet/.

Steps to Reproduce

  1. Activate the pre-built PyTorch-2.1 environment for Inf2, Trn*:
    source /opt/aws_neuronx_venv_pytorch_2_1/bin/activate
  2. Run the provided template code from the repository:
    python3 test3.py
  3. Observe the logs and check for the existence of the model file in the specified directory.

Expected Behavior

The model file model.pt should be present in the directory sd2_compile_dir_768/unet/ after the compilation process completes.

Actual Behavior

The model file model.pt is missing from the directory sd2_compile_dir_768/unet/.

Compilation Logs

2024-05-30T09:32:51Z Running birverifier
2024-05-30T09:32:52Z birverifier finished after 1.166 seconds
2024-05-30T09:32:52Z Running codegen
2024-05-30T09:32:57Z isa_gen finished after 4.293 seconds
2024-05-30T09:32:58Z dma_desc_gen finished after 1.495 seconds
2024-05-30T09:33:01Z debug_info_gen finished after 2.790 seconds
2024-05-30T09:33:02Z codegen finished after 9.213 seconds
2024-05-30T09:33:02Z Running neff_packager
2024-05-30T09:33:29Z neff_packager finished after 27.627 seconds

Error Message

Traceback (most recent call last):
  File "/home/ubuntu/test3.py", line 124, in <module>
    pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/jit/_serialization.py", line 152, in load
    raise ValueError(f"The provided filename {f} does not exist")  # type: ignore[str-bytes-safe]
ValueError: The provided filename sd2_compile_dir_768/unet/model.pt does not exist

Environment Details

Additional Information

Key Value
Repository aws-neuron-samples
Template Used hf_pretrained_sd2_768_inference.ipynb
Script test.py (for compilation) test3.py(for interference)

Screenshots

Screenshot 2024-05-30 at 3 17 04 PM

chafik-c commented 3 weeks ago

The model file model.pt is missing from the directory sd2_compile_dir_768/unet/. Hi, I routed the issue to the appropriate team within the org. We will track it and get back to you.

ShivamB25 commented 3 weeks ago

@chafik-c thanks

ShivamB25 commented 2 weeks ago

@chafik-c it maye be ran out of ram. i tried this on on 8x large and it worked fine