accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
289 stars 110 forks source link

Does Accelsim support analysing Python program? #238

Closed ys-2020 closed 1 year ago

ys-2020 commented 1 year ago

Dear Authors,

Thanks for developing Accelsim framework. I am wondering if it is possible to use accelsim to analyse my own python programs (e.g., matmul with cudnn backends or a linear layer in PyTorch)? Thank you very much!

JRPan commented 1 year ago

Yes. You can use the tracer to collect traces yourselves. You need a real GPU to do that.

Thanks.

ys-2020 commented 1 year ago

Hi @JRPan , thanks for the prompt response! Do you have any examples for that? since it looks like all examples provided in the codebase start from a *.cu file.

I tried to build my own binaries from Python and used ./util/tracer_nvbit/run_hw_trace.py to trace my binary. However, it seems that the tracer cannot locate the kernel and report the Error: Unable to open file: /home/ys/accel-sim-dev/hw_run/traces/device-1/11.8/torch_mm/NO_ARGS/traces/kernelslist.

Thanks very much!

JRPan commented 1 year ago

Can you provide all changes you made? And also all outputs from run_hw_trace.py. You need to update this file to include the run command also https://github.com/accel-sim/accel-sim-framework/blob/release/util/job_launching/apps/define-all-apps.yml. exec_dir should be the folder containing your python script. And execs can just be the python script like run.py

JRPan commented 1 year ago

Or you can trace some examples we provide, and within the trace folder, there should be a run.sh file.

You can reference how the environmental variables are set and launch your python script manually. The tracer injects the CUDA runtime by setting CUDA_INJECTION64_PATH and LD_PRELOAD. With these two variables set, you can just run the script as usual and traces should be generated within the current working directory. Then you need to call post-traces-processing to process the traces.

I don't recommend this but this is a workaround.

Thanks

ys-2020 commented 1 year ago

I changed define-all-apps.yml and added the following code to it:

# Test torch mm
torch-mm:
    exec_dir: "$GPUAPPS_ROOT/bin/$CUDA_VERSION/release/"
    data_dirs: "$GPUAPPS_ROOT/data_dirs/"
    execs:
        - torch_mm:
            - args: 
              accel-sim-mem: 4G

In exec_dir, I have torch_mm.py:

#!/usr/bin/env python3

import torch

a = torch.randn(128,128).cuda()

b = torch.mm(a,a)
print(b)

The output of run_hw_trace.py is as follows:

(accelsim) ys@x4:~/accel-sim-dev$ ./util/tracer_nvbit/run_hw_trace.py -B torch-mm -D 1
Running torch_mm
------------- NVBit (NVidia Binary Instrumentation Tool v1.5.5) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
    EXCLUDE_PRED_OFF = 1 - Exclude predicated off instruction from count
      TRACE_LINEINFO = 0 - Include source code line info at the start of each traced line. The target binary must be compiled with -lineinfo or --generate-line-info
DYNAMIC_KERNEL_LIMIT_END = 0 - Limit of the number kernel to be printed, 0 means no limit
DYNAMIC_KERNEL_LIMIT_START = 0 - start to report kernel from this kernel id, 0 means starts from the beginning, i.e. first kernel
   ACTIVE_FROM_START = 1 - Start instruction tracing from start or wait for cuProfilerStart and cuProfilerStop. If set to 0, DYNAMIC_KERNEL_LIMIT options have no effect
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
       TOOL_COMPRESS = 1 - Enable traces compression
     TOOL_TRACE_CORE = 0 - write the core id in the traces
TERMINATE_UPON_LIMIT = 0 - Stop the process once the current kernel > DYNAMIC_KERNEL_LIMIT_END
USER_DEFINED_FOLDERS = 0 - Uses the user defined folder TRACES_FOLDER path environment
----------------------------------------------------------------------------------------------------
tensor([[ -2.5471,   6.4450,  27.0549,  ...,  -3.4513,   7.3517,  15.4464],
        [  1.0302,   7.5373,  -7.5413,  ...,   9.7875,   2.3590,   2.3195],
        [  1.0601,  -9.4976,  -3.4694,  ...,   1.3495,  -8.1080,   0.1080],
        ...,
        [ 10.1463, -10.0735,  -2.1262,  ..., -11.3776, -22.1502,   6.3534],
        [-35.4395, -12.3930,  -4.2529,  ...,  -3.9852,  16.3150,  -9.2881],
        [-35.4552,  18.4648,  -1.6587,  ...,  -0.2043, -14.5533,  11.3531]],
       device='cuda:0')
Unable to open file: /home/ys/accel-sim-dev/hw_run/traces/device-1/11.8/torch_mm/NO_ARGS/traces/kernelslist
JRPan commented 1 year ago

Okay, it seems the tracer is invoked correctly, but the traces are not generated. I'll ask someone with experience with PyTorch to take a look. Meanwhile, you can try the manual method I explained.

Thanks

ys-2020 commented 1 year ago

Thanks so much for your help! Looking forward to your reply. And I will also try the manual method.

cesar-avalos3 commented 1 year ago

Hello, yes it does, we usually run it with a shell script invoking Python. I PR'ed a fix that allows running python workloads like this as well. The issue was that the tool was not injected properly, pytorch likes to rewrite LD_PRELOAD, so we also have to set CUDA_INJECTION64_PATH.

ys-2020 commented 1 year ago

Thank you for your help!

Wen-Tian-Pineapple commented 10 months ago

Thank you for your help!

Hello, were you able to run the simulation under SASS mode in the end? I'm trying to run the exact same python file but I keep getting seg fault when processing kernel-17.traceg. Have you ever had same problem? Been having this problem for a while. Thanks!

image image