NVIDIA / nvtx-plugins

Python bindings for NVTX
https://docs.nvidia.com/deeplearning/frameworks/nvtx-plugins/user-guide/docs/en/stable/
Apache License 2.0
66 stars 15 forks source link

what is option2 #19

Open ethem-kinginthenorth opened 4 years ago

ethem-kinginthenorth commented 4 years ago

In this example, I am seeing "option 1". What is option2?

Is there a clear example that shows how to use nvtx plugins: 1) is it op trace, start, and end? 2) is it nvtx hook? 3) both?

I am trying to get nvtx plugin working but I am keep getting "The application terminated before the collection started. No report was generated" I am definitely doing something wrong but where.

DEKHTIARJonathan commented 4 years ago

Hi,

FYI: @ahmadki

Right on, we probably not perfectly commented our examples ...


I am trying to get nvtx plugin working but I am keep getting "The application terminated before the collection started. No report was generated" I am definitely doing something wrong but where.

This message means that the delay to start profiling is longer than the time it takes for your program to finish.

If you look here: https://github.com/NVIDIA/nvtx-plugins/blob/master/examples/run_tf_session.sh

nsys profile \
  -d 60 \
  -w true \
  --force-overwrite=true \
  --sample=cpu \
  -t 'nvtx,cuda' \
  --stop-on-exit=true \
  --kill=sigkill \
  -o examples/tf_session_example \
  python examples/tf_session_example.py

Adjust these settings and you'll be fine ;)

$ nsys profile --help

    -y, --delay=
       Collection start delay in seconds.
       Default is 0.

    -d, --duration=
       Collection duration in seconds.
       Default is 0 seconds.

    ...
ethem-kinginthenorth commented 4 years ago

I am more looking this one:

1)

# Option 1: use decorators
@nvtx_tf.ops.trace(message='Dense Block', grad_message='Dense Block grad',
                   domain_name='Forward', grad_domain_name='Gradient',
                   enabled=ENABLE_NVTX, trainable=True)```

2) then I am seeing

x = inputs
    x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1',
        grad_message='Dense 1 grad', domain_name='Forward',
        grad_domain_name='Gradient', trainable=True, enabled=ENABLE_NVTX)
    x = tf.compat.v1.layers.dense(x, 1024, activation=tf.nn.relu, name='dense_1')
    x = nvtx_tf.ops.end(x, nvtx_context)

3) then I also see:

nvtx_callback = NVTXHook(skip_n_steps=1, name='Train')
with tf.compat.v1.train.MonitoredSession(hooks=[nvtx_callback]) as sess:

I tried 1) and 2) individually along with their combination. But still cannot get it run.

The code I used to call profiler is below:

nsys profile -w true -t "cudnn,cuda,osrt,nvtx" -c cudaProfilerApi --stop-on-range-end true --stop-on-exit=true \
--kill=sigkill \
--export=sqlite -o ./test python main.py --arch resnet50 \
--mode train --data_dir /raid/ethem/tfr_small \
--export_dir /raid/ethem/results \
--batch_size 128 --num_iter 1 \
--iter_unit epoch --results_dir /raid/ethem/results \
--display_every 10 --lr_init 0.01 --seed 12345

I am interested in profiling certain places rather than a certain period of time. thanks

DEKHTIARJonathan commented 4 years ago

Try the following:

nsys profile \
  -d 60 \
  -w true \
  --force-overwrite=true \
  --sample=cpu \
  -t 'nvtx,cuda' \
  --stop-on-exit=true \
  --kill=sigkill \
  -o examples/tf_session_example \
  python main.py \
    --arch resnet50 \
    --mode train \
    --data_dir /raid/ethem/tfr_small \
    --export_dir /raid/ethem/results \
    --batch_size 128 \
    --num_iter 1 \
    --iter_unit epoch \
    --results_dir /raid/ethem/results \
    --display_every 10 \
    --lr_init 0.01 \
    --seed 12345

You don't need to combine Option 1 & 2 & 3. They are completely independent.


I am interested in profiling certain places rather than a certain period of time.

You don't want to profile for the whole training it doesn't make it any sense and it will hurt your performance. Profiling is designed to give you on a short script representative of the normal script. You can use some delay to account for warmup and library loading, but a profiling script doesn't more than 50 good iterations to be useful.

ethem-kinginthenorth commented 4 years ago

@DEKHTIARJonathan thanks for the suggestion. I will try.

You touched a great point, I use pyprof with pytorch, where I can control how many iterations to profile. I usually do 1 or 2 iterations (e.g., 10th iteration) which gives me what I want. The reason I tried start and end ntvx plugins to do the same thing with tensorflow. Is there a way to control this or is it just using time related parameters?

ethem-kinginthenorth commented 4 years ago

here is the verdict: when I add "-c cudaProfilerApi" I get the message The application terminated before the collection started. No report was generated.. When removed, independent of the "-y" and "-d" parameters, I am getting something. The documentation says: "profiling will start only when cudaProfilerStart API is invoked" I am a bit puzzled here. So I checked nvidia-smi to make sure it is using GPU, and I can see the usage. I appreciate any guidance here. Thanks.

ethem-kinginthenorth commented 4 years ago

So it seems like -c cudaProfilerApi is not working. As far as I understand, having start and end to limit the part of the code to profile is dependent upon to this parameter (along with --stop-on-range-end true). Therefore, it also does NOT work. Please correct me if I am wrong.

rrforte commented 4 years ago

Using -c cudaProfilerApi would limit capturing to the scope in between cudaProfilerStart()/cudaProfilerStop() calls. Unless you call these functions in your code (maybe using ctypes?), this means nothing will be captured since nvtx-plugins does not call those functions, and AFAIK tensorflow doesn't either (though I am not certain).

Instead, you can try using -c nvtx -e NSYS_NVTX_PROFILER_REGISTER_ONLY=0 then specify your outer range message and domain using -p message@domain (or just -p message if you are using the default domain).

Of course, you can disable the capture range limit entirely by removing -c. You already mentioned that you do get some results when doing so. Are the results OK in this case? Is there a particular reason for using -c cudaProfilerApi?

ethem-kinginthenorth commented 4 years ago

@rrforte I am trying to do profiling for only 1 iteration. In the pytorch, I use pyprof and start and stop profiling for a certain iteration. With start and stop I use -c cudaProfilerApi --stop-on-range-end true. I am trying to do the same on the tensorflow side. Whatever I do , I am keep getting profiling for more than 1 iteration. I appreciate any help on that.