Breakend / experiment-impact-tracker

MIT License
266 stars 31 forks source link

Getting error "Problem with output in nvidia-smi pmon -c 10" #67

Open guillaumeramey opened 3 years ago

guillaumeramey commented 3 years ago

Hi, we're getting this error in the log file :

experiment_impact_tracker.compute_tracker.ImpactTracker - ERROR - Encountered exception within power monitor thread!
experiment_impact_tracker.compute_tracker.ImpactTracker - ERROR -   File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/compute_tracker.py", line 105, in launch_power_monitor
    _sample_and_log_power(log_dir, initial_info, logger=logger)
  File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/compute_tracker.py", line 69, in _sample_and_log_power
    results = header["routing"]["function"](process_ids, logger=logger, region=initial_info['region']['id'], log_dir=log_dir)
  File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/gpu/nvidia.py", line 117, in get_nvidia_gpu_power
    raise ValueError('Problem with output in nvidia-smi pmon -c 10')

Is it an issue with our Nvidia GPU ? We are using Tesla T4.

Breakend commented 3 years ago

Could you let us know what output you get if you run this from the command line on the machine you're using? This will help narrow down the source of the error.

$ nvidia-smi pmon -c 10

guillaumeramey commented 3 years ago

I am using Google Colab so it's not always the same GPU. I ran subprocess.getoutput('nvidia-smi pmon -c 10') but it gave me nothing:

# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              

With subprocess.getoutput('nvidia-smi') I obtained this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Breakend commented 3 years ago

Hi, unfortunately colab isn't fully supported right now because they don't always expose the required hardware endpoints to calculate energy use. We are working on solutions and will follow up if we have something that works.