Closed fangq closed 8 years ago
when this happens, the nvidia-smi prints the following log. it shows clearly the 980Ti is locked in the P2 state. Reboot does not fix the card.
fangq@wazu:~/space/git/Project$ nvidia-smi
Tue Feb 16 22:56:13 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 590 Off | 0000:03:00.0 N/A | N/A |
| 0% 65C P0 N/A / N/A | 165MiB / 1535MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 590 Off | 0000:04:00.0 N/A | N/A |
| 45% 44C P12 N/A / N/A | 5MiB / 1535MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 980 Ti Off | 0000:05:00.0 Off | N/A |
| 0% 48C P2 174W / 250W | 144MiB / 6143MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 15471 C ../../bin/mcx 122MiB |
+-----------------------------------------------------------------------------+
the full nvidia-smi log is also attached below:
fangq@wazu:~/space/git/Project$ nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Tue Feb 16 22:54:06 2016
Driver Version : 352.63
Attached GPUs : 3
GPU 0000:03:00.0
Product Name : GeForce GTX 590
Product Brand : GeForce
.....
GPU 0000:04:00.0
Product Name : GeForce GTX 590
Product Brand : GeForce
.....
GPU 0000:05:00.0
Product Name : GeForce GTX 980 Ti
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-e7884ccd-ea7b-3fa6-bf17-3e6bac064ef2
Minor Number : 2
VBIOS Version : 84.00.32.00.94
MultiGPU Board : No
Board ID : 0x500
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x05
Device : 0x00
Domain : 0x0000
Device Id : 0x17C810DE
Bus Id : 0000:05:00.0
Sub System Id : 0x49933842
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P2
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Active
FB Memory Usage
Total : 6143 MiB
Used : 144 MiB
Free : 5999 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 4 MiB
Free : 252 MiB
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 47 C
GPU Shutdown Temp : 97 C
GPU Slowdown Temp : 92 C
Power Readings
Power Management : Supported
Power Draw : 173.24 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 150.00 W
Max Power Limit : 275.00 W
Clocks
Graphics : 1303 MHz
SM : 1303 MHz
Memory : 3304 MHz
Applications Clocks
Graphics : 1101 MHz
Memory : 3505 MHz
Default Applications Clocks
Graphics : 1101 MHz
Memory : 3505 MHz
Max Clocks
Graphics : 1493 MHz
SM : 1493 MHz
Memory : 3505 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 14839
Type : C
Name : ../../bin/mcx
Used GPU Memory : 122 MiB
using the following command, I was able to change the card from the P2 state to P0
sudo nvidia-smi -i 2 -ac 3505,1493
however, the mcx running speed was still not in the maximum speed possible.
fangq@wazu:~/space/git/Project$ nvidia-smi
Tue Feb 16 23:07:46 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 590 Off | 0000:03:00.0 N/A | N/A |
| 0% 65C P0 N/A / N/A | 165MiB / 1535MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 590 Off | 0000:04:00.0 N/A | N/A |
| 45% 44C P12 N/A / N/A | 5MiB / 1535MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 980 Ti Off | 0000:05:00.0 Off | N/A |
| 0% 51C P0 173W / 250W | 144MiB / 6143MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 18925 C ../../bin/mcx 122MiB |
+-----------------------------------------------------------------------------+
and the nvidia-smi -a outputs the following
GPU 0000:05:00.0
Product Name : GeForce GTX 980 Ti
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-e7884ccd-ea7b-3fa6-bf17-3e6bac064ef2
Minor Number : 2
VBIOS Version : 84.00.32.00.94
MultiGPU Board : No
Board ID : 0x500
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x05
Device : 0x00
Domain : 0x0000
Device Id : 0x17C810DE
Bus Id : 0000:05:00.0
Sub System Id : 0x49933842
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Active
FB Memory Usage
Total : 6143 MiB
Used : 144 MiB
Free : 5999 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 4 MiB
Free : 252 MiB
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 50 C
GPU Shutdown Temp : 97 C
GPU Slowdown Temp : 92 C
Power Readings
Power Management : Supported
Power Draw : 177.12 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 150.00 W
Max Power Limit : 275.00 W
Clocks
Graphics : 1303 MHz
SM : 1303 MHz
Memory : 3505 MHz
Applications Clocks
Graphics : 1493 MHz
Memory : 3505 MHz
Default Applications Clocks
Graphics : 1101 MHz
Memory : 3505 MHz
Max Clocks
Graphics : 1493 MHz
SM : 1493 MHz
Memory : 3505 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 20368
Type : C
Name : ../../bin/mcx
Used GPU Memory : 122 MiB
upgraded cuda to 7.5.18, no change
exit X, and run "nvidia-smi -i 2 -r" to reset the 980Ti gpu, command succeeded, but no impact to speed
also tried CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 and CUDA_VISIBLE_DEVICES env variables, settings are effective, but again, no impact to speed.
looks like this is not a hardware issue. this issue is likely related to either CUDA toolkit versions, or MCX.
here are some new findings
In fact it is true. It is a toolkit issue. Using CUDA 6.5 toolkit with the lastest version of MCX (SHA 8656942c68e7c50e9083e2082d12d847d619476a) on wazu, using 980 Ti, I got MCX simulation speed: 15384.62 photon/ms. During compilation I specified compute capability and sm_code for compute_52 (i.e. -gencode arch=compute_52,code=sm_52).
We solved this issue with the help from NVIDIA developers here: https://devtalk.nvidia.com/default/topic/925630/cuda-programming-and-performance/cuda-7-5-on-maxwell-980ti-drops-performance-by-10x-versus-cuda-7-0-and-6-5/1 Also, it seems that we are getting a fix on the new driver release.
This turns out to be a CUDA bug. Internal bug report 1747451 filed by nvidia developer txbob. Problem identified and fix will be shipped with a new cuda driver.
Issue now closed.
I noticed that my Maxwell (980Ti) GPU occasionally drops to a low-clock state after running MCX for some time. I posted my question at the CUDA forum:
https://devtalk.nvidia.com/default/topic/917213/cuda-programming-and-performance/maxwell-suddernly-becomes-10x-slower/
someone pointed out the P2 state. More googling reveals a couple of similar incidences
http://www.overclock.net/t/1553214/lower-memory-clocks-and-locked-p2-power-state-on-the-gtx-970 https://www.reddit.com/r/nvidia/comments/3au46o/gtx_970_p2_memory_clock_drops/ http://www.prepar3d.com/forum/viewtopic.php?t=109003
I am not entirely sure if this is caused by MCX on a Maxwell, or it is a bug of Maxwell itself.