calab-ntu / gpu-cluster

Eureka and Spock GPU clusters
3 stars 0 forks source link

Monitor error messages in log files of all nodes #12

Open xuanweishan opened 3 years ago

xuanweishan commented 3 years ago

Example:

NVRM: Xid (PCI:0000:41:00): 31, pid=120493, Ch 00000008, intr 00000000. 
MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x7f27_e42aa000. 
Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE