How to use 'nsys' in mlc container

dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T

MIT License

1.88k stars 416 forks source link

How to use 'nsys' in mlc container #531

Closed Louym closed 1 month ago

Louym commented 1 month ago

I'm using a pre-built MLC container on my ARM64 (aarch64) platform with an Orin GPU. I've successfully benchmarked the speed of LLM as per the provided documentation. However, I'm facing an issue where I can only trace CPU activities when attempting to analyze with nsys. Since GPU information is crucial for our analysis, I need to resolve this. How can I configure nsys within the container to include GPU tracing? Screenshot 2024-05-17 212508

dusty-nv commented 1 month ago

Hi @Louym, firstly I'm glad that you were able to get MLC running and benchmarked. I haven't used nsys inside container before and am not sure if there is special setup or tools needed installed inside the container. You might just want to start by profiling a simple CUDA app in a simple CUDA container if you haven't already. Otherwise, you can find the wheels from here (which you can install outside container): http://jetson.webredirect.org/jp5/cu114

Louym commented 1 month ago

Hi @Louym, firstly I'm glad that you were able to get MLC running and benchmarked. I haven't used nsys inside container before and am not sure if there is special setup or tools needed installed inside the container. You might just want to start by profiling a simple CUDA app in a simple CUDA container if you haven't already. Otherwise, you can find the wheels from here (which you can install outside container): http://jetson.webredirect.org/jp5/cu114

Thank you very much for your prompt response. I will try this website later, and I will get back to you if I have any issues.

Louym commented 1 month ago

Hello! @dusty-nv, I've installed TVM on my server and successfully run the Resnet-50 examples from the TVM tutorial. However, I'm encountering an issue when trying to run from tvm.runtime import disco as required by MLC. The following check failure occurs: Even with this line commented out, another error occurs at 'from . import base': Even if I comments all these lines out, I meet more issues when using ChatModule of mlc.

I'm seeking guidance on the correct installation process for mlc_chat or mlc_llm, as I directly used pip to install the wheels. Could you please advise?

Thank you!

dusty-nv commented 1 month ago

Hmm I haven't done this outside of container, but it would seem that the MLC version of the wheel does not correspond to the TVM version of the wheel. In the containers I lock the package versions to make sure the matching ones get installed, which you can see those versions here in its config.py:

https://github.com/dusty-nv/jetson-containers/blob/5ba535905ff275e02ff779112d351a754aece94f/packages/llm/mlc/config.py#L31

In general, I would try and replicate the install just how I have done it in the container. It looks like you are using conda and I don't think that should impact it but not sure?

Louym commented 1 month ago

I tried again and find that I can use 'nsys profile -stats true' to see the summary of GPU kernels while the timeline is still not visible. But I also meet some issues when building tvm-unity(or relax) from source, although I have: set(USE_CUDA ON) set(USE_FLASHINFER ON) set(FLASHINFER_CUDA_ARCHITECTURES 87) set(CMAKE_CUDA_ARCHITECTURES 87) It would be great if you could help me build tvm outside container.

Louym commented 1 month ago

I manage to see the nsys timeline results by compiling models in containers and then benchmarking it in the environment built with the website http://jetson.webredirect.org/jp6/cu122 using jp 6 and cuda 12.2 out of containers.

dusty-nv commented 1 month ago

Awesome @Louym , glad you managed to get it working 👍