ROCm / rocprofiler-compute

Advanced Profiling and Analytics for AMD Hardware
https://rocm.docs.amd.com/projects/omniperf/en/latest/
MIT License
135 stars 49 forks source link

Unable to profile DLM: KeyError: 'BeginNs' #32

Closed clphuong closed 1 year ago

clphuong commented 1 year ago

Description: Some workloads fail on timestamp generation. ShibuyaStream, DLM Accuracy tests fail

OS/distro: Ubuntu 5.15.0-52-generic #58~20.04.1-Ubuntu ROCm Version: 5.2.0 Omniperf Version: 1.0.4dev Logs of crash output:


[433 rows x 17 columns]
File 'dml_profile_DEEPSPEED_ROBERTA_data/dml_profile_DEEPSPEED_ROBERTA/mi200/timestamps.csv' is generating
Traceback (most recent call last):
  File "/home/svt/clement/omni/python-libs/pandas/core/indexes/base.py", line 3803, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'BeginNs'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 630, in <module>
    main()
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 525, in main
    omniperf_profile(args,VER)
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 376, in omniperf_profile
    replace_timestamps(workload_dir)
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 113, in replace_timestamps
    df_pmc_perf["BeginNs"] = df_stamps["BeginNs"]
  File "/home/svt/clement/omni/python-libs/pandas/core/frame.py", line 3804, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/svt/clement/omni/python-libs/pandas/core/indexes/base.py", line 3805, in get_loc
    raise KeyError(key) from err
KeyError: 'BeginNs'

Steps to reproduce:

  1. Install ROCm, omniperf
  2. Set export variables as per installation
  3. 
    git clone https://github.com/ROCmSoftwarePlatform/DeepLearningModels
    cd DeepLearningModels
    #modify the tags.json to the following:
    {
        "tags": [
                "pyt_train_huggingface_distilbert"
        ]

}

run:

omniperf profile --name dml_profile --path dml_profile_data echo val | sudo -S ./tools/run_models.py --timeout 0



4. Observe failure after 23 loops

Expected: timestamps.csv generated, successful profiling 
Actual: KeyError, timestamps.csv is EMPTY, profile fail.
jrmadsen commented 1 year ago

I'm not surprised timestamps.csv is empty.

omniperf profile --name dml_profile --path dml_profile_data echo val | sudo -S ./tools/run_models.py --timeout 0

You appear to running omniperf on the echo command before piping in your password (val) to sudo via stdin. You'll get an empty file if you run rocprof echo too.

jrmadsen commented 1 year ago

I think you want:

echo val | sudo -S omniperf profile --name dml_profile --path dml_profile_data -- ./tools/run_models.py --timeout 0
jrmadsen commented 1 year ago

Although running a Python script with sudo is an absolutely horrendous idea.

clphuong commented 1 year ago

I think you want:

echo val | sudo -S omniperf profile --name dml_profile --path dml_profile_data -- ./tools/run_models.py --timeout 0

I did indeed use the echo val, but it does run into some syntax error saying "command val not found" (right before sudo -S so that is absolutely my mistake).

Nonetheless, I definitely copy pasta'd the older command to get this ticket in. Note that I've tried it with miperf as well with -c "./tools/run_models.py --timeout 0" as well

Other commands I've tried are here:

omniperf profile --name dml_profile --path dml_profile_data -- ./tools/run_models.py --timeout 0

Running shibuyastream is also yields similar results so perhaps we can look into that as well.

coleramos425 commented 1 year ago

Archiving issue. Please reopen if issue persists.