fangq / mcx

Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator
http://mcx.space
Other
133 stars 73 forks source link

Maxwell GPU may get locked in P2 state when running mcx #18

Closed fangq closed 8 years ago

fangq commented 8 years ago

I noticed that my Maxwell (980Ti) GPU occasionally drops to a low-clock state after running MCX for some time. I posted my question at the CUDA forum:

https://devtalk.nvidia.com/default/topic/917213/cuda-programming-and-performance/maxwell-suddernly-becomes-10x-slower/

someone pointed out the P2 state. More googling reveals a couple of similar incidences

http://www.overclock.net/t/1553214/lower-memory-clocks-and-locked-p2-power-state-on-the-gtx-970 https://www.reddit.com/r/nvidia/comments/3au46o/gtx_970_p2_memory_clock_drops/ http://www.prepar3d.com/forum/viewtopic.php?t=109003

I am not entirely sure if this is caused by MCX on a Maxwell, or it is a bug of Maxwell itself.

fangq commented 8 years ago

when this happens, the nvidia-smi prints the following log. it shows clearly the 980Ti is locked in the P2 state. Reboot does not fix the card.

fangq@wazu:~/space/git/Project$ nvidia-smi 
Tue Feb 16 22:56:13 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 590     Off  | 0000:03:00.0     N/A |                  N/A |
|  0%   65C    P0    N/A /  N/A |    165MiB /  1535MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 590     Off  | 0000:04:00.0     N/A |                  N/A |
| 45%   44C   P12    N/A /  N/A |      5MiB /  1535MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 980 Ti  Off  | 0000:05:00.0     Off |                  N/A |
|  0%   48C    P2   174W / 250W |    144MiB /  6143MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
|    2     15471    C   ../../bin/mcx                                  122MiB |
+-----------------------------------------------------------------------------+
fangq commented 8 years ago

the full nvidia-smi log is also attached below:

fangq@wazu:~/space/git/Project$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                           : Tue Feb 16 22:54:06 2016
Driver Version                      : 352.63

Attached GPUs                       : 3
GPU 0000:03:00.0
    Product Name                    : GeForce GTX 590
    Product Brand                   : GeForce
 .....

GPU 0000:04:00.0
    Product Name                    : GeForce GTX 590
    Product Brand                   : GeForce
 .....

GPU 0000:05:00.0
    Product Name                    : GeForce GTX 980 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-e7884ccd-ea7b-3fa6-bf17-3e6bac064ef2
    Minor Number                    : 2
    VBIOS Version                   : 84.00.32.00.94
    MultiGPU Board                  : No
    Board ID                        : 0x500
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x05
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x17C810DE
        Bus Id                      : 0000:05:00.0
        Sub System Id               : 0x49933842
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 2
            Link Width
                Max                 : 16x
                Current             : 8x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 0 %
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Active
    FB Memory Usage
        Total                       : 6143 MiB
        Used                        : 144 MiB
        Free                        : 5999 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 4 MiB
        Free                        : 252 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 100 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 47 C
        GPU Shutdown Temp           : 97 C
        GPU Slowdown Temp           : 92 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 173.24 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 275.00 W
    Clocks
        Graphics                    : 1303 MHz
        SM                          : 1303 MHz
        Memory                      : 3304 MHz
    Applications Clocks
        Graphics                    : 1101 MHz
        Memory                      : 3505 MHz
    Default Applications Clocks
        Graphics                    : 1101 MHz
        Memory                      : 3505 MHz
    Max Clocks
        Graphics                    : 1493 MHz
        SM                          : 1493 MHz
        Memory                      : 3505 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 14839
            Type                    : C
            Name                    : ../../bin/mcx
            Used GPU Memory         : 122 MiB
fangq commented 8 years ago

using the following command, I was able to change the card from the P2 state to P0

sudo nvidia-smi -i 2 -ac 3505,1493

however, the mcx running speed was still not in the maximum speed possible.

fangq@wazu:~/space/git/Project$ nvidia-smi 
Tue Feb 16 23:07:46 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 590     Off  | 0000:03:00.0     N/A |                  N/A |
|  0%   65C    P0    N/A /  N/A |    165MiB /  1535MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 590     Off  | 0000:04:00.0     N/A |                  N/A |
| 45%   44C   P12    N/A /  N/A |      5MiB /  1535MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 980 Ti  Off  | 0000:05:00.0     Off |                  N/A |
|  0%   51C    P0   173W / 250W |    144MiB /  6143MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
|    2     18925    C   ../../bin/mcx                                  122MiB |
+-----------------------------------------------------------------------------+

and the nvidia-smi -a outputs the following

GPU 0000:05:00.0
    Product Name                    : GeForce GTX 980 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-e7884ccd-ea7b-3fa6-bf17-3e6bac064ef2
    Minor Number                    : 2
    VBIOS Version                   : 84.00.32.00.94
    MultiGPU Board                  : No
    Board ID                        : 0x500
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x05
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x17C810DE
        Bus Id                      : 0000:05:00.0
        Sub System Id               : 0x49933842
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 2
            Link Width
                Max                 : 16x
                Current             : 8x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 0 %
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Active
    FB Memory Usage
        Total                       : 6143 MiB
        Used                        : 144 MiB
        Free                        : 5999 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 4 MiB
        Free                        : 252 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 100 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 50 C
        GPU Shutdown Temp           : 97 C
        GPU Slowdown Temp           : 92 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 177.12 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 275.00 W
    Clocks
        Graphics                    : 1303 MHz
        SM                          : 1303 MHz
        Memory                      : 3505 MHz
    Applications Clocks
        Graphics                    : 1493 MHz
        Memory                      : 3505 MHz
    Default Applications Clocks
        Graphics                    : 1101 MHz
        Memory                      : 3505 MHz
    Max Clocks
        Graphics                    : 1493 MHz
        SM                          : 1493 MHz
        Memory                      : 3505 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 20368
            Type                    : C
            Name                    : ../../bin/mcx
            Used GPU Memory         : 122 MiB
fangq commented 8 years ago

upgraded cuda to 7.5.18, no change

exit X, and run "nvidia-smi -i 2 -r" to reset the 980Ti gpu, command succeeded, but no impact to speed

also tried CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 and CUDA_VISIBLE_DEVICES env variables, settings are effective, but again, no impact to speed.

fangq commented 8 years ago

looks like this is not a hardware issue. this issue is likely related to either CUDA toolkit versions, or MCX.

here are some new findings

  1. I found another brand-new 980Ti from my collaborator, and I got the same slow speed (1200 photon/ms) with the latest git code (29ea42)
  2. for an older mcx release (v0.9.7-2), if I run the pre-compiled binary (linked with cuda-6.0), I got 14000/photon/ms on the same 980Ti
  3. for v0.9.7-2, if I recompile the source code using cuda 6, I was able to get the good speed (14000) from the new binary
  4. for v0.9.7-2, if I recompile the source using cuda 7.5, I only got 1500 photon/ms
  5. if I recompile the latest git code (29ea42) using cuda 6.5, I got 1200 photon/ms
  6. recompiling the latest git code using cuda 6.5 on another machine with a 980 (not 980Ti), I got 14400 photon/ms
fninaparavecino commented 8 years ago

In fact it is true. It is a toolkit issue. Using CUDA 6.5 toolkit with the lastest version of MCX (SHA 8656942c68e7c50e9083e2082d12d847d619476a) on wazu, using 980 Ti, I got MCX simulation speed: 15384.62 photon/ms. During compilation I specified compute capability and sm_code for compute_52 (i.e. -gencode arch=compute_52,code=sm_52).

fninaparavecino commented 8 years ago

We solved this issue with the help from NVIDIA developers here: https://devtalk.nvidia.com/default/topic/925630/cuda-programming-and-performance/cuda-7-5-on-maxwell-980ti-drops-performance-by-10x-versus-cuda-7-0-and-6-5/1 Also, it seems that we are getting a fix on the new driver release.

fangq commented 8 years ago

This turns out to be a CUDA bug. Internal bug report 1747451 filed by nvidia developer txbob. Problem identified and fix will be shipped with a new cuda driver.

Issue now closed.