Closed Gabriel4256 closed 2 years ago
@Gabriel4256 responses to your profiler queries above
For profiling operators mapped to the CPU on TF 1.x you would need to run an inference under a tensorflow Session and then run model_analyzer.profile. Please refer to this example from our open source repo https://github.com/aws/aws-neuron-tensorflow/blob/1.16.0/python/saved_model.py#L367. More details about model_analyzer.profile may be found in our OpenPose tutorial at https://aws.amazon.com/blogs/machine-learning/deploying-tensorflow-openpose-on-aws-inferentia-based-inf1-instances-for-significant-price-performance-improvements/.
• CPU operators doesn't exist on visualized graph of Tensorboard. How can I get a graph containing both CPU and Inferentia operators? The visualized graph for cpu operations are not visible as of today in tensorboard/neuron-profile profiles. We will look into this as a possible extension in the future
• TF2 models appear to be not fully profiled with Neuron Plugin for TensorBoard: their execution time on NueronDevice and CPU are not calculated properly. Is there any other way to profile TF2 models?
Can you elaborate on the issues that you see with Neuron execution times for CPU and NeuronDevice? For CPU operators, TF 2.x does not support the model_analyzer.profile API. In theory tensorflow-neuron can work with the new "tracer view" interface (https://www.tensorflow.org/guide/profiler#sections_and_tracks), but we haven't tried it internally so far
• Is it possbile to get execution time of each operator in a wall clock, not a number of cycle? The execution time per operator is an estimate based on the notification timestamps of the corresponding instruction(s). For Inf1, you can do the estimation based on the conversion formula that 1 cycle = 1ns
@aws-joshim
For profiling operators mapped to the CPU on TF 1.x you would need to run an inference under a tensorflow Session and then run model_analyzer.profile. Please refer to this example from our open source repo https://github.com/aws/aws-neuron-tensorflow/blob/1.16.0/python/saved_model.py#L367. More details about model_analyzer.profile may be found in our OpenPose tutorial at https://aws.amazon.com/blogs/machine-learning/deploying-tensorflow-openpose-on-aws-inferentia-based-inf1-instances-for-significant-price-performance-improvements/.
I succeded to profile CPU operators following your instruction. Thank you. I have some additional questions about profiling.
I got the timeline trace using timeline_json option of model_analyzer.profile
. But it doesn't contain information about the memory copy time. How can I get memory time information (both from host memory to device memory and the opposite way) ?
Also, is it possible to do the same thing on other frameworks such as TF 2 and Pytorch?
Can you elaborate on the issues that you see with Neuron execution times for CPU and NeuronDevice? For CPU operators, TF 2.x does not support the model_analyzer.profile API. In theory tensorflow-neuron can work with the new "tracer view" interface (https://www.tensorflow.org/guide/profiler#sections_and_tracks), but we haven't tried it internally so far
When I followed bert tutorial on TF2, I got the following result on Tensorboard. Neuron Execution time is not displayed properly as below:
@aws-joshim Are you still working on this issue? I just want to know when I can get the response.
Hi @Gabriel4256,
I got the timeline trace using timeline_json option of model_analyzer.profile. But it doesn't contain information about the memory copy time. How can I get memory time information (both from host memory to device memory and the opposite way)?
The memory copy time to and from device is included in the NeuronOp execution time.
Also, is it possible to do the same thing on other frameworks such as TF 2 and Pytorch?
For TF2, the model_analyzer.profile API is deprecated, but there is a similar one that can be viewed in tensorboard (https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras#debug_performance_bottlenecks). For Pytorch, it is recommended to use torch.autograd.profiler (https://pytorch.org/docs/stable/_modules/torch/autograd/profiler.html), which is also compatible with Neuron. Similarly to TF, memory copy time is included in the neuron::forward_v2 operator. On the Neuron Execution Time table, the compute time collection for TensorBoard will be fixed in an upcoming release.
@aws-owinop Thanks for your response. I have an additional question.
The memory copy time to and from device is included in the NeuronOp execution time.
Is there way to get just memory copy time?, to and from device respectively.
Currently this breakdown is not supported.
Then could I know the transfer speed between host and Inferentia? So that I can calculate memory copy time approximately using it.
Hi Gabriel, Please use Tensorboard to profile end to end performance as explained in the detail in the earlier post by aws-owinop. The transfer speed from host to Inferentia is just one of the factors that impact performance. If you can describe in more detail what is the performance issue you are looking to resolve I might be able to provide more specific advise
@aws-zejdaj Hi, I am currently trying to solve the problems as below:
I have one more question. I am currently using model_analyzer.profile
and Tensorboard for profiling operator execution on CPU and Inferentia respectively. But there is a discrepency in the Neuron execution time between them. For example, in the yoloV3 model, model_analyzer.profile
says it takes 74.32ms to execute nueron op, but the tensorboard profiler says 19.74964ms as shown below.
model_analyzer.profile
darknet/neuron_op_40079fd99a167dfc (14.48MB/14.48MB, 74.32ms/74.32ms, 0us/0us, 74.32ms/74.32ms)
Tensorboard
Based on the earlier post, my understanding is that time calculated by model_analyzer.profile
is the sum of memory copy time (host <-> device) and the actual execution time on Inferentia, which is the "NeuronCore Time" on Tensorboard profiler. But then, what's the meaning of "On CPU Time" on Tensorboard profiler? Is my understaning correct?
Based on the partitioning, the compiled Neuron model can still contain operators that will be executed on CPU, but will not perform as well as having the whole model running on NeuronCores. The NeuronCore time and On CPU Time displayed in TensorBoard both contribute to the compiled NeuronOp darknet/neuron_op_40079fd99a167dfc
for a single inference.
In terms of the time you see in model_analyzer.profile
, the YOLOv3 example has dynamic batch size enabled with compile-time batch size 2 and evaluation batch size 8. As a result model_analyzer.profile
runs 4 inferences at a time
Hi @Gabriel4256, please let us know if you still have problems with understanding the Neuron profile. Thanks!
But even though when the model is compiled to run with batch size 1, there is a difference in displayed execution time of conv5_block3_3_bn
in Tensorboard and model_analyzer.profile
. I used ResNet50 tutorial here to compile and run the model. what else contributes to this difference?
The time shown by model_analyzer.profile also includes some overhead to setup the inputs and outputs of a NeuronOp, whereas the time in TensorBoard shows how much time is spent executing on the devices.
Does that answer your question?
It's been a few day now. I am going to assume we've addressed the question - closing.
Hi, team.
I have several questions about Neuron Plugin for TensorBoard.
I tried yolo v3 model on this tutorial, and was able to see only operators running on Inferentia, even though the model actually contains many unsupported operators (e.g.,
TensorArrayV3
,Enter
,Merge
,Switch
, ...). CPU operators doesn't exist neither on visualized graph nor execution time table. Instead, what I can see was just total CPU execution time.Here is my execution environment:
result of
pip list
:Thanks in advance.