Open pszi1ard opened 6 years ago
Not sure if related, but in the meantime I see the following messages in dmesg:
[209359.980927] perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[258359.848121] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[258359.848125] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[258359.848154] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000
[258359.848174] pcieport 0000:00:02.0: [14] Completion Timeout (First)
[258359.848190] pcieport 0000:00:02.0: broadcast error_detected message
[258359.848193] pcieport 0000:00:02.0: AER: Device recovery failed
[258359.865616] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[258359.865619] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[258359.865665] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000
[258359.865698] pcieport 0000:00:02.0: [14] Completion Timeout (First)
[258359.865726] pcieport 0000:00:02.0: broadcast error_detected message
[258359.865728] pcieport 0000:00:02.0: AER: Device recovery failed
I have the team looking into this
GROMACS runs that seemed fine before stall and fail to complete since the last ROCm update. Symptoms: with small inputs that run ~100s of microseconds per iteration (one clFinish per iteration), after a few thousand to tens of thousands of iterations the run stalls. Backtrace:
Reproduce with the following command: gmx mdrun -ntmpi 1 -nb gpu -pme cpu -notunepme -nsteps -1 -s water-0000.96.tpr -v (runs inifinte number of iterations, terminate with SIGTERM if needed) Input file can be obtained from here: https://www.dropbox.com/s/627cyb2nzmwrqi0/water-0000.96.tpr