Profiling CPU operator informantion

Gabriel4256 commented 2 years ago

Hi, team.

I have several questions about Neuron Plugin for TensorBoard.

How can I get the execution time of each operator running on CPU? Tensorboard appears to show such information only for operators running on Inferentia.
CPU operators doesn't exist on visualized graph of Tensorboard. How can I get a graph containing both CPU and Inferentia operators?
TF2 models appear to be not fully profiled with Neuron Plugin for TensorBoard: their execution time on NueronDevice and CPU are not calculated properly. Is there any other way to profile TF2 models?
Is it possbile to get execution time of each operator in a wall clock, not a number of cycle?

I tried yolo v3 model on this tutorial, and was able to see only operators running on Inferentia, even though the model actually contains many unsupported operators (e.g., TensorArrayV3, Enter, Merge, Switch, ...). CPU operators doesn't exist neither on visualized graph nor execution time table. Instead, what I can see was just total CPU execution time.

Here is my execution environment:

ec2 inf1-xlarge instance
ubuntu 18 DL AMI
Neuron compiler: 1.8.5.0+5989ca3ec
Initialize the environment following this
aws_neuron_tensorflow_p36 env

result of pip list:

Package                            Version
---------------------------------- ---------------------------
absl-py                            1.0.0
alabaster                          0.7.12
anyio                              3.3.2
argh                               0.26.2
argon2-cffi                        20.1.0
arrow                              1.2.1
asn1crypto                         1.4.0
astor                              0.8.1
astroid                            2.8.0
astropy                            4.1
async-generator                    1.10
atomicwrites                       1.4.0
attrs                              21.4.0
autopep8                           1.5.6
autovizwidget                      0.19.1
awscli                             1.22.17
Babel                              2.9.1
backcall                           0.2.0
backports.functools-lru-cache      1.6.4
backports.shutil-get-terminal-size 1.0.0
beautifulsoup4                     4.10.0
bert-tensorflow                    1.0.1
binaryornot                        0.4.4
bitarray                           2.0.1
bkcharts                           0.2
black                              21.12b0
bleach                             4.1.0
bokeh                              2.3.3
boto                               2.49.0
boto3                              1.20.17
botocore                           1.23.17
Bottleneck                         1.3.2
brotlipy                           0.7.0
cached-property                    1.5.2
cachetools                         4.2.4
certifi                            2021.10.8
cffi                               1.15.0
chardet                            4.0.0
charset-normalizer                 2.0.10
click                              7.1.2
cloudpickle                        2.0.0
clyent                             1.2.2
colorama                           0.4.3
contextlib2                        21.6.0
contextvars                        2.4
cookiecutter                       1.7.0
coverage                           6.2
cryptography                       35.0.0
cycler                             0.11.0
Cython                             0.29.26
cytoolz                            0.11.0
dask                               2021.3.0
dataclasses                        0.8
datasets                           1.4.1
decorator                          5.1.0
defusedxml                         0.7.1
diff-match-patch                   20200713
dill                               0.3.4
distributed                        2021.3.0
dmlc-nnvm                          1.8.2.0+0
dmlc-topi                          1.8.2.0+0
dmlc-tvm                           1.8.2.0+0
docutils                           0.15.2
dparse                             0.5.1
entrypoints                        0.3
environment-kernels                1.1.1
et-xmlfile                         1.0.1
fastcache                          1.1.0
filelock                           3.4.0
flake8                             3.8.4
Flask                              2.0.2
Flask-Cors                         3.0.10
fsspec                             2021.11.1
future                             0.18.2
gast                               0.2.2
gevent                             21.1.2
glob2                              0.7
gmpy2                              2.1.0b5
google-auth                        1.35.0
google-auth-oauthlib               0.4.6
google-pasta                       0.2.0
greenlet                           1.1.0
grpcio                             1.43.0
gssapi                             1.7.2
h5py                               2.10.0
hdijupyterutils                    0.19.1
HeapDict                           1.0.1
html5lib                           1.1
huggingface-hub                    0.0.2
idna                               3.3
imagecodecs-lite                   2019.12.3
imageio                            2.13.1
imagesize                          1.3.0
immutables                         0.15
importlib-metadata                 4.8.3
inferentia-hwm                     1.8.2.0+0
inflection                         0.5.1
iniconfig                          1.1.1
intervaltree                       3.0.2
ipykernel                          5.5.5
ipyparallel                        8.0.0
ipython                            7.16.1
ipython-genutils                   0.2.0
ipywidgets                         7.6.5
islpy                              2018.2+aws2018.x.891.0.bld0
isort                              5.10.1
itsdangerous                       2.0.1
jdcal                              1.4.1
jedi                               0.17.2
jeepney                            0.7.1
Jinja2                             3.0.3
jinja2-time                        0.2.0
jmespath                           0.10.0
joblib                             1.1.0
json5                              0.9.5
jsonschema                         4.1.2
jupyter                            1.0.0
jupyter-client                     7.1.0
jupyter-console                    6.4.0
jupyter-core                       4.8.1
jupyter-server                     1.13.1
jupyterlab                         3.2.5
jupyterlab-pygments                0.1.2
jupyterlab-server                  2.9.0
jupyterlab-widgets                 1.0.2
Keras                              2.2.4
Keras-Applications                 1.0.8
keras-mxnet                        2.2.4.2
Keras-Preprocessing                1.1.2
keyring                            23.2.1
kiwisolver                         1.3.1
krb5                               0.2.0
lazy-object-proxy                  1.6.0
libarchive-c                       3.1
llvmlite                           0.36.0
locket                             0.2.0
lxml                               4.7.1
Markdown                           3.3.6
MarkupSafe                         2.0.1
matplotlib                         3.3.4
mccabe                             0.6.1
mistune                            0.8.4
mkl-fft                            1.3.0
mkl-random                         1.2.0
mkl-service                        2.3.0
mock                               4.0.3
more-itertools                     8.12.0
mpmath                             1.2.1
msgpack                            1.0.2
multipledispatch                   0.6.0
multiprocess                       0.70.12.2
mypy-extensions                    0.4.3
nbclassic                          0.3.4
nbclient                           0.5.9
nbconvert                          6.0.7
nbformat                           5.1.3
nest-asyncio                       1.5.4
networkx                           2.4
neuron-cc                          1.8.5.0+5989ca3ec
nltk                               3.6.6
nose                               1.3.7
notebook                           6.4.6
numba                              0.53.1
numexpr                            2.7.3
numpy                              1.19.5
numpydoc                           1.1.0
oauthlib                           3.1.1
olefile                            0.46
opencv-python                      4.5.1.48
openpyxl                           3.0.9
opt-einsum                         3.3.0
packaging                          21.3
pandas                             1.1.5
pandocfilters                      1.5.0
parso                              0.7.0
partd                              1.2.0
path                               16.2.0
pathlib2                           2.3.6
pathos                             0.2.8
pathspec                           0.9.0
pathtools                          0.1.2
patsy                              0.5.2
pep8                               1.7.1
pexpect                            4.8.0
pickleshare                        0.7.5
Pillow                             8.4.0
pip                                21.3.1
pkginfo                            1.8.2
platformdirs                       2.3.0
plotly                             5.5.0
pluggy                             1.0.0
ply                                3.11
pox                                0.3.0
poyo                               0.5.0
ppft                               1.6.6.4
prometheus-client                  0.12.0
prompt-toolkit                     3.0.24
protobuf                           3.19.1
protobuf3-to-dict                  0.1.5
psutil                             5.8.0
psycopg2                           2.7.5
ptyprocess                         0.7.0
py                                 1.11.0
pyarrow                            6.0.1
pyasn1                             0.4.8
pyasn1-modules                     0.2.8
pycocotools                        2.0.1
pycodestyle                        2.6.0
pycosat                            0.6.3
pycparser                          2.21
pycryptodome                       3.12.0
pycurl                             7.43.0.6
pydocstyle                         6.1.1
pyflakes                           2.2.0
pygal                              2.4.0
Pygments                           2.10.0
pyinstrument                       3.4.2
pyinstrument-cext                  0.2.4
pykerberos                         1.2.1
pylint                             2.11.1
pyls-black                         0.4.6
pyls-spyder                        0.3.2
pyodbc                             4.0.31
pyOpenSSL                          21.0.0
pyparsing                          3.0.6
PyQt5                              5.12.3
PyQt5_sip                          4.19.18
PyQtChart                          5.12
PyQtWebEngine                      5.12.1
pyrsistent                         0.17.3
PySocks                            1.7.1
pyspnego                           0.3.1
pytest                             6.2.5
python-dateutil                    2.8.2
python-jsonrpc-server              0.4.0
python-language-server             0.36.2
pytz                               2021.3
PyWavelets                         1.1.1
pyxdg                              0.27
PyYAML                             6.0
pyzmq                              22.1.0
QDarkStyle                         3.0.2
qstylizer                          0.2.1
QtAwesome                          1.1.1
qtconsole                          5.2.2
QtPy                               1.11.3
regex                              2021.8.3
requests                           2.27.1
requests-kerberos                  0.14.0
requests-oauthlib                  1.3.0
rope                               0.22.0
rsa                                4.8
Rtree                              0.9.7
ruamel-yaml-conda                  0.15.80
s3transfer                         0.5.0
sacremoses                         0.0.46
safety                             1.10.3
sagemaker                          2.70.0
scikit-image                       0.17.2
scikit-learn                       0.24.2
scipy                              1.4.1
seaborn                            0.11.2
SecretStorage                      3.3.1
Send2Trash                         1.8.0
setuptools                         59.6.0
shap                               0.40.0
simplegeneric                      0.8.1
singledispatch                     0.0.0
sip                                4.19.25
six                                1.16.0
slicer                             0.0.7
smdebug                            1.0.12
smdebug-rulesconfig                1.0.1
sniffio                            1.2.0
snowballstemmer                    2.2.0
sortedcollections                  2.1.0
sortedcontainers                   2.4.0
soupsieve                          2.3.1
sparkmagic                         0.15.0
Sphinx                             4.3.2
sphinxcontrib-applehelp            1.0.2
sphinxcontrib-devhelp              1.0.2
sphinxcontrib-htmlhelp             2.0.0
sphinxcontrib-jsmath               1.0.1
sphinxcontrib-qthelp               1.0.3
sphinxcontrib-serializinghtml      1.1.5
sphinxcontrib-websupport           1.2.4
spyder                             5.0.5
spyder-kernels                     2.0.5
SQLAlchemy                         1.4.22
statsmodels                        0.12.2
sympy                              1.8
tables                             3.6.1
tblib                              1.7.0
tenacity                           8.0.1
tensorboard                        2.4.0
tensorboard-data-server            0.6.1
tensorboard-plugin-neuron          2.2.0.0
tensorboard-plugin-wit             1.8.1
tensorflow                         1.15.5
tensorflow-estimator               1.15.1
tensorflow-hub                     0.12.0
tensorflow-neuron                  1.15.5.2.0.5.0
tensorflow-serving-api             1.15.0
termcolor                          1.1.0
terminado                          0.12.1
testpath                           0.5.0
textdistance                       4.2.2
threadpoolctl                      3.0.0
three-merge                        0.1.1
tifffile                           2020.6.3
tinycss2                           1.1.1
tokenizers                         0.9.4
toml                               0.10.2
tomli                              1.2.2
toolz                              0.11.2
torch                              1.5.1
tornado                            6.1
tqdm                               4.49.0
traitlets                          4.3.3
transformers                       4.1.0
typed-ast                          1.4.3
typing_extensions                  4.0.1
ujson                              4.0.2
unicodecsv                         0.14.1
urllib3                            1.26.7
watchdog                           2.1.6
wcwidth                            0.2.5
webencodings                       0.5.1
websocket-client                   1.2.3
Werkzeug                           2.0.2
wheel                              0.37.1
whichcraft                         0.6.1
widgetsnbextension                 3.5.1
wrapt                              1.13.3
wurlitzer                          3.0.2
xlrd                               2.0.1
XlsxWriter                         3.0.2
xlwt                               1.3.0
xxhash                             2.0.2
yapf                               0.31.0
zict                               2.0.0
zipp                               3.6.0
zope.event                         4.5.0
zope.interface                     5.4.0

Thanks in advance.

aws-joshim commented 2 years ago

@Gabriel4256 responses to your profiler queries above

How can I get the execution time of each operator running on CPU? Tensorboard appears to show such information only for operators running on Inferentia.

For profiling operators mapped to the CPU on TF 1.x you would need to run an inference under a tensorflow Session and then run model_analyzer.profile. Please refer to this example from our open source repo https://github.com/aws/aws-neuron-tensorflow/blob/1.16.0/python/saved_model.py#L367. More details about model_analyzer.profile may be found in our OpenPose tutorial at https://aws.amazon.com/blogs/machine-learning/deploying-tensorflow-openpose-on-aws-inferentia-based-inf1-instances-for-significant-price-performance-improvements/.

• CPU operators doesn't exist on visualized graph of Tensorboard. How can I get a graph containing both CPU and Inferentia operators? The visualized graph for cpu operations are not visible as of today in tensorboard/neuron-profile profiles. We will look into this as a possible extension in the future

• TF2 models appear to be not fully profiled with Neuron Plugin for TensorBoard: their execution time on NueronDevice and CPU are not calculated properly. Is there any other way to profile TF2 models?

Can you elaborate on the issues that you see with Neuron execution times for CPU and NeuronDevice? For CPU operators, TF 2.x does not support the model_analyzer.profile API. In theory tensorflow-neuron can work with the new "tracer view" interface (https://www.tensorflow.org/guide/profiler#sections_and_tracks), but we haven't tried it internally so far

• Is it possbile to get execution time of each operator in a wall clock, not a number of cycle? The execution time per operator is an estimate based on the notification timestamps of the corresponding instruction(s). For Inf1, you can do the estimation based on the conversion formula that 1 cycle = 1ns

Gabriel4256 commented 2 years ago

@aws-joshim

For profiling operators mapped to the CPU on TF 1.x you would need to run an inference under a tensorflow Session and then run model_analyzer.profile. Please refer to this example from our open source repo https://github.com/aws/aws-neuron-tensorflow/blob/1.16.0/python/saved_model.py#L367. More details about model_analyzer.profile may be found in our OpenPose tutorial at https://aws.amazon.com/blogs/machine-learning/deploying-tensorflow-openpose-on-aws-inferentia-based-inf1-instances-for-significant-price-performance-improvements/.

I succeded to profile CPU operators following your instruction. Thank you. I have some additional questions about profiling.

I got the timeline trace using timeline_json option of model_analyzer.profile. But it doesn't contain information about the memory copy time. How can I get memory time information (both from host memory to device memory and the opposite way) ?
Also, is it possible to do the same thing on other frameworks such as TF 2 and Pytorch?

Can you elaborate on the issues that you see with Neuron execution times for CPU and NeuronDevice? For CPU operators, TF 2.x does not support the model_analyzer.profile API. In theory tensorflow-neuron can work with the new "tracer view" interface (https://www.tensorflow.org/guide/profiler#sections_and_tracks), but we haven't tried it internally so far

When I followed bert tutorial on TF2, I got the following result on Tensorboard. Neuron Execution time is not displayed properly as below:

Gabriel4256 commented 2 years ago

@aws-joshim Are you still working on this issue? I just want to know when I can get the response.

aws-owinop commented 2 years ago

Hi @Gabriel4256,

I got the timeline trace using timeline_json option of model_analyzer.profile. But it doesn't contain information about the memory copy time. How can I get memory time information (both from host memory to device memory and the opposite way)?

The memory copy time to and from device is included in the NeuronOp execution time.

Also, is it possible to do the same thing on other frameworks such as TF 2 and Pytorch?

For TF2, the model_analyzer.profile API is deprecated, but there is a similar one that can be viewed in tensorboard (https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras#debug_performance_bottlenecks). For Pytorch, it is recommended to use torch.autograd.profiler (https://pytorch.org/docs/stable/_modules/torch/autograd/profiler.html), which is also compatible with Neuron. Similarly to TF, memory copy time is included in the neuron::forward_v2 operator. On the Neuron Execution Time table, the compute time collection for TensorBoard will be fixed in an upcoming release.

Gabriel4256 commented 2 years ago

@aws-owinop Thanks for your response. I have an additional question.

The memory copy time to and from device is included in the NeuronOp execution time.

Is there way to get just memory copy time?, to and from device respectively.

aws-owinop commented 2 years ago

Currently this breakdown is not supported.

Gabriel4256 commented 2 years ago

Then could I know the transfer speed between host and Inferentia? So that I can calculate memory copy time approximately using it.

aws-zejdaj commented 2 years ago

Hi Gabriel, Please use Tensorboard to profile end to end performance as explained in the detail in the earlier post by aws-owinop. The transfer speed from host to Inferentia is just one of the factors that impact performance. If you can describe in more detail what is the performance issue you are looking to resolve I might be able to provide more specific advise

Gabriel4256 commented 2 years ago

@aws-zejdaj Hi, I am currently trying to solve the problems as below:

For many models, it seems that Neuron's auto paritioning doesn't partition the model properly. Many operators that can be executed on Inferentia are compiled to be run on CPU. So, I am considering developing my own model partitioner, using the manual partitioning feature.
And after partitioning, I also want to schedule the execution of each subgraph, instead of just executing it sequentially. For this scheduling, memory transfer speed is needed to estimate the memory time.

Gabriel4256 commented 2 years ago

I have one more question. I am currently using model_analyzer.profile and Tensorboard for profiling operator execution on CPU and Inferentia respectively. But there is a discrepency in the Neuron execution time between them. For example, in the yoloV3 model, model_analyzer.profile says it takes 74.32ms to execute nueron op, but the tensorboard profiler says 19.74964ms as shown below.

model_analyzer.profile

darknet/neuron_op_40079fd99a167dfc (14.48MB/14.48MB, 74.32ms/74.32ms, 0us/0us, 74.32ms/74.32ms)

Tensorboard

Based on the earlier post, my understanding is that time calculated by model_analyzer.profile is the sum of memory copy time (host <-> device) and the actual execution time on Inferentia, which is the "NeuronCore Time" on Tensorboard profiler. But then, what's the meaning of "On CPU Time" on Tensorboard profiler? Is my understaning correct?

aws-owinop commented 2 years ago

Based on the partitioning, the compiled Neuron model can still contain operators that will be executed on CPU, but will not perform as well as having the whole model running on NeuronCores. The NeuronCore time and On CPU Time displayed in TensorBoard both contribute to the compiled NeuronOp darknet/neuron_op_40079fd99a167dfc for a single inference.

In terms of the time you see in model_analyzer.profile, the YOLOv3 example has dynamic batch size enabled with compile-time batch size 2 and evaluation batch size 8. As a result model_analyzer.profile runs 4 inferences at a time

jeffhataws commented 2 years ago

Hi @Gabriel4256, please let us know if you still have problems with understanding the Neuron profile. Thanks!

Gabriel4256 commented 2 years ago

But even though when the model is compiled to run with batch size 1, there is a difference in displayed execution time of conv5_block3_3_bn in Tensorboard and model_analyzer.profile. I used ResNet50 tutorial here to compile and run the model. what else contributes to this difference?

mrnikwaws commented 2 years ago

The time shown by model_analyzer.profile also includes some overhead to setup the inputs and outputs of a NeuronOp, whereas the time in TensorBoard shows how much time is spent executing on the devices.

Does that answer your question?

mrnikwaws commented 2 years ago

It's been a few day now. I am going to assume we've addressed the question - closing.

aws-neuron / aws-neuron-sdk

Profiling CPU operator informantion #384