MooreThreads / torch_musa

torch_musa is an open source repository based on PyTorch, which can make full use of the super computing power of MooreThreads graphics cards.
Other
292 stars 17 forks source link

GPU fault (1: Guilty Lockup) detected when processing kick #22

Closed dixyes closed 6 months ago

dixyes commented 8 months ago

当两块显卡显存占用合计超过16g的时候就寄了

[234620.907129] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da779de00 flags=0x0000]
[234620.907155] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da7794200 flags=0x0000]
[234620.907172] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da7792200 flags=0x0000]
[234620.907187] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da7799e00 flags=0x0000]
[234620.907203] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da7796200 flags=0x0000]
[234620.907219] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da778e200 flags=0x0000]
[234620.907234] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da7783500 flags=0x0000]
[234620.907249] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da7798200 flags=0x0000]
[234620.907266] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da7788d00 flags=0x0000]
[234620.907281] mtgpu 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x7da7784d00 flags=0x0000]
[234620.907297] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da7790200 flags=0x0000]
[234620.907312] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da7787500 flags=0x0000]
[234620.907327] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da778a200 flags=0x0000]
[234620.907341] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da7795500 flags=0x0000]
[234620.907356] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da778c200 flags=0x0000]
[234620.907370] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da77a9600 flags=0x0000]
[234620.907385] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da779f600 flags=0x0000]
[234620.907400] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da779d600 flags=0x0000]
[234620.907414] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da7797500 flags=0x0000]
[234620.907429] AMD-Vi: Event logged [IO_PAGE_FAULT device=81:00.0 domain=0x0010 address=0x7da779b600 flags=0x0000]
[234620.918852] MTGPU:  550: ------------[ MTGPU DBG: START (High) ]------------
[234620.918856] MTGPU:  550: OS kernel info: Linux 5.4.0-169-generic #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023 x86_64
[234620.918858] MTGPU:  550: DDK info: MUSA_Driver_Linux_WS musadriver 1.0@0 (release) mtgpu_linux
[234620.918859] MTGPU:  550: Time now: 234620966898us
[234620.918860] MTGPU:  550: Services State: OK
[234620.918862] MTGPU:  550: Server Errors: 0
[234620.918866] MTGPU:  550: Connections Device ID:1(129) P20911-V20911-T20911-python, P21057-V21057-T21057-python
[234620.918868] MTGPU:  550: ------[ Driver Info ]------
[234620.918869] MTGPU:  550: Comparison of UM/KM components: MATCHING
[234620.918870] MTGPU:  550: KM Arch: 64 Bit
[234620.918871] MTGPU:  550: UM Connected Clients: 64 Bit
[234620.918873] MTGPU:  550: UM info: 1.0 @        0 (release) build options: 0x80000850
[234620.918875] MTGPU:  550: KM info: 1.0 @        0 (release) build options: 0x00000850
[234620.918876] MTGPU:  550: Window system: xorg
[234620.918878] MTGPU:  550: ------[ Server Thread Summary ]------
[234620.918879] MTGPU:  550:   mtgpu cleanup : Running
[234620.918881] MTGPU:  550:     Number of deferred cleanup items Queued : 0
[234620.918882] MTGPU:  550:     Number of deferred cleanup items dropped after retry limit reached : 0
[234620.918883] MTGPU:  550:   mtgpu watchdog : Running
[234620.918887] MTGPU:  550: ------[ MUSA Device ID:1 Start ]------
[234620.918888] MTGPU:  550: ------[ MUSA Info ]------
[234620.918891] MTGPU:  550: Device Node (Info): 00000000f8918306 (00000000570e9e81)
[234620.918892] MTGPU:  550: MUSA Version: 1.0.0.0 (volcanic)
[234620.918894] MTGPU:  550: MUSA Device State: Active
[234620.918895] MTGPU:  550: MUSA Power State: ON
[234620.918901] MTGPU:  550: FW info: 1.0 @        0 (release) build options: 0x80000850
[234620.918902] MTGPU:  550: TRP: HW support - No
[234620.918903] MTGPU:  550: WGP: HW support - No
[234620.918910] MTGPU:  550: MMU (Core) - OK
[234620.918913] MTGPU:  550: MMU (Meta) - OK
[234620.918919] MTGPU:  550: MUSA FW State: OK (HWRState 0x00000001: HWR OK;)
[234620.918923] MTGPU:  550: MUSA FW Power State: RGXFWIF_POW_ON (APM disabled: 0 ok, 0 denied, 0 non-idle, 0 retry, 0 other, 0 total. Latency: 1000 ms)
[234620.918951] MTGPU:  550: MUSA DVFS: 0 frequency changes. Current frequency: 1799.983 MHz (sampled at 234632055234583 ns). FW frequency: 1800.000 MHz.
[234620.918958] MTGPU:  550: MUSA FW OS 0 - State: active; Freelists: Ok; Priority: 0; MTS on;
[234620.919003] MTGPU:  550: Number of HWR: GP(0/0+0), TDM(0/0+0), GEOM(0/0+0), 3D(0/0+0), CDM(1/1+0), FALSE(0,0,0,0,0)
[234620.919005] MTGPU:  550: DM 0 (GP)
[234620.919036] MTGPU:  550: DM 1 (HWRflags 0x00000000: working;)
[234620.919058] MTGPU:  550: DM 2 (HWRflags 0x00000000: working;)
[234620.919080] MTGPU:  550: DM 3 (HWRflags 0x00000000: working;)
[234620.919102] MTGPU:  550: DM 4 (HWRflags 0x00000000: working;)
[234620.919116] MTGPU:  550:   Recovery 1: Core = 0, PID = 21057, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000200, Guilty Lockup
[234620.919122] MTGPU:  550:               CRTimer = 0x01802EC5F10F, OSTimer = 234634.998294975, CyclesElapsed = 177064448
[234620.919128] MTGPU:  550:               PreResetTimeInCycles = 86528, HWResetTimeInCycles = 339968, TotalResetTimeInCycles = 426496
[234620.919133] MTGPU:  550:     MMU (Core) - FAULT:
[234620.919139] MTGPU:  550:       * MMU status (0x03150083E9D03409 | 0x00000025): PC = 21, Reading from 0x83E9D03400, MCU L1(IP1 Global), Fault (Page Table).
[234620.919555] MTGPU:  550:     PCE for index 527 = 0x03518051 and is valid
[234620.919557] MTGPU:  550:     PDE for index 334 = 0x000000adbad00000 and is not valid
[234620.919558] MTGPU:  550:     PT index (0) out of bounds (0)
[234620.919572] MTGPU:  550:   Recovery 1: Core = 1, PID = 21057, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000000, Guilty Lockup
[234620.919577] MTGPU:  550:               CRTimer = 0x01802EC5F12C, OSTimer = 234634.998299100, CyclesElapsed = 177071872
[234620.919583] MTGPU:  550:               PreResetTimeInCycles = 79104, HWResetTimeInCycles = 339968, TotalResetTimeInCycles = 419072
[234620.919588] MTGPU:  550:     MMU (Core) - FAULT:
[234620.919592] MTGPU:  550:       * MMU status (0x03150083E9B52089 | 0x000050C3): PC = 21, Reading from 0x83E9B52080, MCU L1(IP1 Global), Fault (Page Table).
[234620.919601] MTGPU:  550:     PCE for index 527 = 0x03518051 and is valid
[234620.919602] MTGPU:  550:     PDE for index 333 = 0x000000adbad00000 and is not valid
[234620.919603] MTGPU:  550:     PT index (0) out of bounds (0)
[234620.919616] MTGPU:  550:   Recovery 1: Core = 2, PID = 21057, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000000, Guilty Lockup
[234620.919621] MTGPU:  550:               CRTimer = 0x01802EC5F147, OSTimer = 234634.998302940, CyclesElapsed = 177078784
[234620.919626] MTGPU:  550:               PreResetTimeInCycles = 72192, HWResetTimeInCycles = 339968, TotalResetTimeInCycles = 412160
[234620.919630] MTGPU:  550:     MMU (Core) - FAULT:
[234620.919633] MTGPU:  550:       * MMU status (0x01150083E98AC509 | 0x00001035): PC = 21, Reading from 0x83E98AC500, MCU L1(IP0 Global), Fault (Page Table).
[234620.919641] MTGPU:  550:     PCE for index 527 = 0x03518051 and is valid
[234620.919642] MTGPU:  550:     PDE for index 332 = 0x000000adbad00000 and is not valid
[234620.919642] MTGPU:  550:     PT index (0) out of bounds (0)
[234620.919655] MTGPU:  550:   Recovery 1: Core = 3, PID = 21057, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000000, Guilty Lockup
[234620.919659] MTGPU:  550:               CRTimer = 0x01802EC5F168, OSTimer = 234634.998307633, CyclesElapsed = 177087232
[234620.919664] MTGPU:  550:               PreResetTimeInCycles = 63744, HWResetTimeInCycles = 339968, TotalResetTimeInCycles = 403712
[234620.919668] MTGPU:  550:     MMU (Core) - FAULT:
[234620.919671] MTGPU:  550:       * MMU status (0x03150083E9BA2589 | 0x000053DA): PC = 21, Reading from 0x83E9BA2580, MCU L1(IP1 Global), Fault (Page Table).
[234620.919679] MTGPU:  550:     PCE for index 527 = 0x03518051 and is valid
[234620.919680] MTGPU:  550:     PDE for index 333 = 0x000000adbad00000 and is not valid
[234620.919680] MTGPU:  550:     PT index (0) out of bounds (0)
[234620.919693] MTGPU:  550:   Recovery 1: Core = 4, PID = 21057, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000000, Guilty Lockup
[234620.919697] MTGPU:  550:               CRTimer = 0x01802EC5F189, OSTimer = 234634.998312327, CyclesElapsed = 177095680
[234620.919702] MTGPU:  550:               PreResetTimeInCycles = 55296, HWResetTimeInCycles = 339968, TotalResetTimeInCycles = 395264
[234620.919706] MTGPU:  550:     MMU (Core) - FAULT:
[234620.919709] MTGPU:  550:       * MMU status (0x01150083E9841609 | 0x000013E0): PC = 21, Reading from 0x83E9841600, MCU L1(IP0 Global), Fault (Page Table).
[234620.919717] MTGPU:  550:     PCE for index 527 = 0x03518051 and is valid
[234620.919718] MTGPU:  550:     PDE for index 332 = 0x000000adbad00000 and is not valid
[234620.919718] MTGPU:  550:     PT index (0) out of bounds (0)
[234620.919730] MTGPU:  550:   Recovery 1: Core = 5, PID = 21057, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000000, Guilty Lockup
[234620.919735] MTGPU:  550:               CRTimer = 0x01802EC5F1A4, OSTimer = 234634.998316167, CyclesElapsed = 177102592
[234620.919740] MTGPU:  550:               PreResetTimeInCycles = 48384, HWResetTimeInCycles = 339968, TotalResetTimeInCycles = 388352
[234620.919744] MTGPU:  550:     MMU (Core) - FAULT:
[234620.919747] MTGPU:  550:       * MMU status (0x01150083E9C60E89 | 0x0000126D): PC = 21, Reading from 0x83E9C60E80, MCU L1(IP0 Global), Fault (Page Table).
[234620.919755] MTGPU:  550:     PCE for index 527 = 0x03518051 and is valid
[234620.919756] MTGPU:  550:     PDE for index 334 = 0x000000adbad00000 and is not valid
[234620.919756] MTGPU:  550:     PT index (0) out of bounds (0)
[234620.919768] MTGPU:  550:   Recovery 1: Core = 6, PID = 21057, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000000, Guilty Lockup
[234620.919773] MTGPU:  550:               CRTimer = 0x01802EC5F1C1, OSTimer = 234634.998320291, CyclesElapsed = 177110016
[234620.919778] MTGPU:  550:               PreResetTimeInCycles = 40960, HWResetTimeInCycles = 339968, TotalResetTimeInCycles = 380928
[234620.919782] MTGPU:  550:     MMU (Core) - FAULT:
[234620.919785] MTGPU:  550:       * MMU status (0x03150083E9CC3B09 | 0x00005217): PC = 21, Reading from 0x83E9CC3B00, MCU L1(IP1 Global), Fault (Page Table).
[234620.919793] MTGPU:  550:     PCE for index 527 = 0x03518051 and is valid
[234620.919794] MTGPU:  550:     PDE for index 334 = 0x000000adbad00000 and is not valid
[234620.919794] MTGPU:  550:     PT index (0) out of bounds (0)
[234620.919806] MTGPU:  550:   Recovery 1: Core = 7, PID = 21057, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000000, Guilty Lockup
[234620.919811] MTGPU:  550:               CRTimer = 0x01802EC5F1E2, OSTimer = 234634.998324985, CyclesElapsed = 177118464
[234620.919816] MTGPU:  550:               PreResetTimeInCycles = 32512, HWResetTimeInCycles = 339968, TotalResetTimeInCycles = 372480
[234620.919820] MTGPU:  550:     MMU (Core) - FAULT:
[234620.919824] MTGPU:  550:       * MMU status (0x01150083E9CE2389 | 0x00001043): PC = 21, Reading from 0x83E9CE2380, MCU L1(IP0 Global), Fault (Page Table).
[234620.919831] MTGPU:  550:     PCE for index 527 = 0x03518051 and is valid
[234620.919832] MTGPU:  550:     PDE for index 334 = 0x000000adbad00000 and is not valid
[234620.919833] MTGPU:  550:     PT index (0) out of bounds (0)
[234620.919846] MTGPU:  550: MUSA Kernel CCB WO:0x30 RO:0x30
[234620.919848] MTGPU:  550: MUSA Firmware CCB WO:0x2 RO:0x2
[234620.919850] MTGPU:  550: MUSA Kernel CCB commands executed = 13360
[234620.919852] MTGPU:  550: MUSA SLR: Forced UFO updates requested = 0
[234620.919854] MTGPU:  550: MUSA Errors: WGP:0, TRP:0
[234620.919856] MTGPU:  550: MUSA FW thread 0: FW IRQ count = 1665, Last sampled IRQ count in LISR = 1665
[234620.919857] MTGPU:  550: MUSA FW thread 1: FW IRQ count = 0, Last sampled IRQ count in LISR = 0
[234620.919861] MTGPU:  550: FW System config flags = 0x20020000 (Ctx switch options: Medium CSW profile; ISP v1 scheduling;)
[234620.919864] MTGPU:  550: FW OS config flags = 0x0000001B (Ctx switch: TDM; GEOM; CDM; RDM;)
[234620.919865] MTGPU:  550: ------[ MUSA registers ]------
[234620.919865] MTGPU:  550: MUSA Register Base Address (Linear):   0x000000008c394f38
[234620.919866] MTGPU:  550: MUSA Register Base Address (Physical): 0xF6C00000
[234620.919871] MTGPU:  550: MULTICORE                     : 0x00000007F8000000
[234620.919873] MTGPU:  550: MULTICORE_SYSTEM              : 0x00000008
[234620.919874] MTGPU:  550: MULTICORE_DOMAIN              : 0x00000008
[234620.919876] MTGPU:  550: EVENT_STATUS                  : 0x00000000
[234620.919879] MTGPU:  550: TIMER                         : 0x000001802EC6176A
[234620.919882] MTGPU:  550: CLK_CTRL0                     : 0xAA8A02000A222202
[234620.919885] MTGPU:  550: CLK_STATUS0                   : 0x0000000000000000
[234620.919888] MTGPU:  550: CLK_CTRL1                     : 0xAAA02A2AAA8AA2AA
[234620.919891] MTGPU:  550: CLK_STATUS1                   : 0x0000000000000000
[234620.919894] MTGPU:  550: MMU_FAULT_STATUS1             : 0x0000000000000000
[234620.919897] MTGPU:  550: MMU_FAULT_STATUS2             : 0x0000000000000000
[234620.919899] MTGPU:  550: MMU_FAULT_STATUS_PM           : 0x0000000000000000
[234620.919902] MTGPU:  550: MMU_FAULT_STATUS_META         : 0x0000000000000000
[234620.919905] MTGPU:  550: SLC_STATUS1                   : 0x0000000000000000
[234620.919908] MTGPU:  550: SLC_STATUS2                   : 0x0200000000000000
[234620.919911] MTGPU:  550: SLC_STATUS_DEBUG              : 0x0000000000000000
[234620.919914] MTGPU:  550: MMU_STATUS                    : 0x0000000000000000
[234620.919915] MTGPU:  550: BIF_PFS                       : 0x000000FF
[234620.919917] MTGPU:  550: BIF_TEXAS0_PFS                : 0x0000000F
[234620.919919] MTGPU:  550: BIF_TEXAS1_PFS                : 0x0000000F
[234620.919920] MTGPU:  550: BIF_OUTSTANDING_READ          : 0x00000000
[234620.919922] MTGPU:  550: BIF_TEXAS0_OUTSTANDING_READ   : 0x00000000
[234620.919924] MTGPU:  550: BIF_TEXAS1_OUTSTANDING_READ   : 0x00000000
[234620.919925] MTGPU:  550: FBCDC_IDLE                    : 0x00003FFF
[234620.919927] MTGPU:  550: FBCDC_STATUS                  : 0x00000000
[234620.919929] MTGPU:  550: SPU_ENABLE                    : 0x00000000
[234620.919932] MTGPU:  550: CONTEXT_MAPPING0              : 0x0000000000000000
[234620.919934] MTGPU:  550: CONTEXT_MAPPING2              : 0x0000000000000000
[234620.919937] MTGPU:  550: CONTEXT_MAPPING3              : 0x0000000000000000
[234620.919940] MTGPU:  550: CONTEXT_MAPPING4              : 0x0000000000000000
[234620.919943] MTGPU:  550: MMU_OSID_CTXT_MAPPING0        : 0x0000000000000000
[234620.919946] MTGPU:  550: MMU_OSID_CTXT_MAPPING1        : 0x0000000000000000
[234620.919947] MTGPU:  550: MULTICORE_AXI                 : 0x00000000
[234620.919949] MTGPU:  550: MULTICORE_AXI_ERROR           : 0x00000000
[234620.919951] MTGPU:  550: MULTICORE_TDM_CTRL_COMMON     : 0x000001FF
[234620.919953] MTGPU:  550: MULTICORE_FRAGMENT_CTRL_COMMON: 0x000010FF
[234620.919954] MTGPU:  550: MULTICORE_COMPUTE_CTRL_COMMON : 0x000001FF
[234620.919956] MTGPU:  550: PERF_PHASE_2D                 : 0x00000000
[234620.919958] MTGPU:  550: PERF_CYCLE_2D_TOTAL           : 0x00000000
[234620.919959] MTGPU:  550: PERF_PHASE_GEOM               : 0x00000000
[234620.919961] MTGPU:  550: PERF_CYCLE_GEOM_TOTAL         : 0x00000000
[234620.919963] MTGPU:  550: PERF_PHASE_FRAG               : 0x00000000
[234620.919964] MTGPU:  550: PERF_CYCLE_FRAG_TOTAL         : 0x00000000
[234620.919966] MTGPU:  550: PERF_CYCLE_GEOM_OR_FRAG_TOTAL : 0x00000000
[234620.919968] MTGPU:  550: PERF_CYCLE_GEOM_AND_FRAG_TOTAL: 0x00000000
[234620.919969] MTGPU:  550: PERF_PHASE_COMP               : 0x00000000
[234620.919971] MTGPU:  550: PERF_CYCLE_COMP_TOTAL         : 0x00000000
[234620.919973] MTGPU:  550: PM_PARTIAL_RENDER_ENABLE      : 0x00000000
[234620.919974] MTGPU:  550: ISP_RENDER                    : 0x00077500
[234620.919976] MTGPU:  550: ISP_CTL                       : 0x00780000
[234620.919978] MTGPU:  550: MTS_INTCTX                    : 0x00000000
[234620.919979] MTGPU:  550: MTS_BGCTX                     : 0x00000000
[234620.919981] MTGPU:  550: MTS_BGCTX_COUNTED_SCHEDULE    : 0x00000000
[234620.919983] MTGPU:  550: MTS_SCHEDULE                  : 0x00000000
[234620.919984] MTGPU:  550: MTS_GPU_INT_STATUS            : 0x00000128
[234620.919986] MTGPU:  550: CDM_CONTEXT_STORE_STATUS      : 0x00000000
[234620.919989] MTGPU:  550: CDM_CONTEXT_PDS0              : 0x0000000000000000
[234620.919992] MTGPU:  550: CDM_CONTEXT_PDS1              : 0x0000000040000000
[234620.919994] MTGPU:  550: CDM_TERMINATE_PDS             : 0x0000000000000000
[234620.919997] MTGPU:  550: CDM_TERMINATE_PDS1            : 0x0000000040000000
[234620.920000] MTGPU:  550: CDM_CONTEXT_LOAD_PDS0         : 0x0000000000000000
[234620.920003] MTGPU:  550: CDM_CONTEXT_LOAD_PDS1         : 0x0000000040000000
[234620.920005] MTGPU:  550: JONES_IDLE                    : 0x0000FC7F
[234620.920006] MTGPU:  550: SLC_IDLE                      : 0x000FFFFF
[234620.920008] MTGPU:  550: SLC_FAULT_STOP_STATUS         : 0x00000000
[234620.920011] MTGPU:  550: SCRATCH0                      : 0x0000000000000000
[234620.920014] MTGPU:  550: SCRATCH1                      : 0x0000000000000000
[234620.920017] MTGPU:  550: SCRATCH2                      : 0x0000000000000000
[234620.920019] MTGPU:  550: SCRATCH3                      : 0x0000000000000000
[234620.920022] MTGPU:  550: SCRATCH4                      : 0x0000000000000000
[234620.920025] MTGPU:  550: SCRATCH5                      : 0x0000000000000000
[234620.920028] MTGPU:  550: SCRATCH6                      : 0x0000000000000000
[234620.920031] MTGPU:  550: SCRATCH7                      : 0x0000000000000000
[234620.920033] MTGPU:  550: SCRATCH8                      : 0x0000000000000000
[234620.920036] MTGPU:  550: SCRATCH9                      : 0x0000000000000000
[234620.920039] MTGPU:  550: SCRATCH10                     : 0x0000000000000000
[234620.920042] MTGPU:  550: SCRATCH11                     : 0x0000000000000000
[234620.920045] MTGPU:  550: SCRATCH12                     : 0x0000000000000000
[234620.920047] MTGPU:  550: SCRATCH13                     : 0x0000000000000000
[234620.920050] MTGPU:  550: SCRATCH14                     : 0x0000000000000000
[234620.920053] MTGPU:  550: SCRATCH15                     : 0x0000000000000000
[234620.920055] MTGPU:  550: IRQ_OS0_EVENT_STATUS          : 0x00000000
[234620.920057] MTGPU:  550: META_SP_MSLVIRQSTATUS         : 0x00000000
[234620.920063] MTGPU:  550: T0 TXENABLE                   : 0x0201C031
[234620.920068] MTGPU:  550: T0 TXSTATUS                   : 0x00020008
[234620.920074] MTGPU:  550: T0 TXDEFR                     : 0x00000000
[234620.920091] MTGPU:  550: T0 PC                         : 0x8000AE10
[234620.920109] MTGPU:  550: T0 PCX                        : 0x40002534
[234620.920126] MTGPU:  550: T0 SP                         : 0x82000050
[234620.920132] MTGPU:  550: T1 TXENABLE                   : 0x0201C131
[234620.920137] MTGPU:  550: T1 TXSTATUS                   : 0x00020000
[234620.920143] MTGPU:  550: T1 TXDEFR                     : 0x00000000
[234620.920160] MTGPU:  550: T1 PC                         : 0x8000AC88
[234620.920177] MTGPU:  550: T1 PCX                        : 0x00000000
[234620.920208] MTGPU:  550: T1 SP                         : 0x82001020
[234620.920209] MTGPU:  550: ------[ MUSA FW Trace Info ]------
[234620.920211] MTGPU:  550: Debug log type: none
[234620.920213] MTGPU:  550: MUSA FW thread 0: Trace buffer not yet allocated
[234620.920214] MTGPU:  550: Debug log type: none
[234620.920216] MTGPU:  550: MUSA FW thread 1: Trace buffer not yet allocated
[234620.920216] MTGPU:  550: ------[ Full CCB Status ]------
[234620.920221] MTGPU:  550: FWCtx 0x10042380 (CDM-P20911-T20911-python)
[234620.920222] MTGPU:  550: Page Table Root 0x00000003`00140000
[234620.920222] MTGPU:  550:   `--<Empty>
[234620.920226] MTGPU:  550: FWCtx 0x10042580 (CDM-P20911-T20911-python)
[234620.920226] MTGPU:  550: Page Table Root 0x00000003`00140000
[234620.920227] MTGPU:  550:   `--<Empty>
[234620.920230] MTGPU:  550: FWCtx 0x10042780 (CDM-P20911-T20911-python)
[234620.920231] MTGPU:  550: Page Table Root 0x00000003`00140000
[234620.920231] MTGPU:  550:   `--<Empty>
[234620.920236] MTGPU:  550: FWCtx 0x10042D00 (CDM-P21057-T21057-python)
[234620.920241] MTGPU:  550: Page Table Root 0x00000003`003E0000
[234620.920244] MTGPU:  550:   `--<Empty>
[234620.920251] MTGPU:  550: FWCtx 0x10093000 (CDM-P21057-T21057-python)
[234620.920254] MTGPU:  550: Page Table Root 0x00000003`003E0000
[234620.920259] MTGPU:  550:   `--<Empty>
[234620.920268] MTGPU:  550:   |--Retired CDM @ 1536 Int=1552 Ext=67109119
[234620.920277] MTGPU:  550:   |--Retired UPDATE @ 1728 Int=1552 Ext=67109119
[234620.920283] MTGPU:  550:   |  |--Addr:0xf0094008 Val=0x00001388
[234620.920289] MTGPU:  550:   |  |--Addr:0xf0094004 Val=0x000038e8
[234620.920294] MTGPU:  550:   |  `--Addr:0xf0095001 Val=0x00000519
[234620.920302] MTGPU:  550:   |--Retired CDM @ 1792 Int=1555 Ext=67109120
[234620.920311] MTGPU:  550:   |--Retired UPDATE @ 1984 Int=1555 Ext=67109120
[234620.920318] MTGPU:  550:   |  |--Addr:0xf0094008 Val=0x000013ec
[234620.920326] MTGPU:  550:   |  |--Addr:0xf0094004 Val=0x000039e8
[234620.920331] MTGPU:  550:   |  `--Addr:0xf00950c9 Val=0x00000519
[234620.920338] MTGPU:  550:   |--Retired CDM @ 2048 Int=1563 Ext=67109121
[234620.920345] MTGPU:  550:   |--Retired UPDATE @ 2240 Int=1563 Ext=67109121
[234620.920349] MTGPU:  550:   |  |--Addr:0xf0094008 Val=0x00001450
[234620.920352] MTGPU:  550:   |  |--Addr:0xf0094004 Val=0x00003ae8
[234620.920356] MTGPU:  550:   |  `--Addr:0xf0095031 Val=0x00000519
[234620.920362] MTGPU:  550:   |--Retired CDM @ 2304 Int=1598 Ext=67109122
[234620.920368] MTGPU:  550:   `--Retired UPDATE @ 2496 Int=1598 Ext=67109122
[234620.920372] MTGPU:  550:      |--Addr:0xf0094008 Val=0x000014b4
[234620.920375] MTGPU:  550:      |--Addr:0xf0094004 Val=0x00003be8
[234620.920378] MTGPU:  550:      `--Addr:0xf0095049 Val=0x00000519
[234620.920381] MTGPU:  550: FWCtx 0x10093180 (CDM-P21057-T21057-python)
[234620.920383] MTGPU:  550: Page Table Root 0x00000003`003E0000
[234620.920384] MTGPU:  550:   `--<Empty>
[234620.920388] MTGPU:  550: FWCtx 0x10042080 (TQ_TDM-P20911-T20911-python)
[234620.920389] MTGPU:  550: Page Table Root 0x00000003`00140000
[234620.920389] MTGPU:  550:   `--<Empty>
[234620.920393] MTGPU:  550: FWCtx 0x10042180 (TQ_TDM-P20911-T20911-python)
[234620.920394] MTGPU:  550: Page Table Root 0x00000003`00140000
[234620.920394] MTGPU:  550:   `--<Empty>
[234620.920398] MTGPU:  550: FWCtx 0x10042280 (TQ_TDM-P20911-T20911-python)
[234620.920398] MTGPU:  550: Page Table Root 0x00000003`00140000
[234620.920398] MTGPU:  550:   `--<Empty>
[234620.920402] MTGPU:  550: FWCtx 0x10042A00 (TQ_TDM-P21057-T21057-python)
[234620.920402] MTGPU:  550: Page Table Root 0x00000003`003E0000
[234620.920403] MTGPU:  550:   `--<Empty>
[234620.920406] MTGPU:  550: FWCtx 0x10042B00 (TQ_TDM-P21057-T21057-python)
[234620.920407] MTGPU:  550: Page Table Root 0x00000003`003E0000
[234620.920408] MTGPU:  550:   `--<Empty>
[234620.920414] MTGPU:  550:   |--Retired TQ_TDM @ 11840 Int=1594 Ext=0
[234620.920420] MTGPU:  550:   |--Retired UPDATE @ 11936 Int=1594 Ext=0
[234620.920424] MTGPU:  550:   |  |--Addr:0xf0091000 Val=0x0000051e
[234620.920427] MTGPU:  550:   |  `--Addr:0xf0095111 Val=0x00000519
[234620.920433] MTGPU:  550:   |--Retired TQ_TDM @ 11992 Int=1595 Ext=0
[234620.920439] MTGPU:  550:   |--Retired UPDATE @ 12088 Int=1595 Ext=0
[234620.920444] MTGPU:  550:   |  |--Addr:0xf0091000 Val=0x0000051f
[234620.920447] MTGPU:  550:   |  `--Addr:0xf0095019 Val=0x00000519
[234620.920452] MTGPU:  550:   |--Retired TQ_TDM @ 12144 Int=1596 Ext=0
[234620.920458] MTGPU:  550:   |--Retired UPDATE @ 12240 Int=1596 Ext=0
[234620.920462] MTGPU:  550:   |  |--Addr:0xf0091000 Val=0x00000520
[234620.920465] MTGPU:  550:   |  `--Addr:0xf0095089 Val=0x00000519
[234620.920471] MTGPU:  550:   |--Retired TQ_TDM @ 12296 Int=1597 Ext=0
[234620.920477] MTGPU:  550:   `--Retired UPDATE @ 12392 Int=1597 Ext=0
[234620.920482] MTGPU:  550:      |--Addr:0xf0091000 Val=0x00000521
[234620.920488] MTGPU:  550:      `--Addr:0xf0095021 Val=0x00000519
[234620.920493] MTGPU:  550: FWCtx 0x10042C00 (TQ_TDM-P21057-T21057-python)
[234620.920496] MTGPU:  550: Page Table Root 0x00000003`003E0000
[234620.920500] MTGPU:  550:   `--<Empty>
[234620.920503] MTGPU:  550: ------[ MUSA Device ID:1 End ]------
[234620.920509] MTGPU:  550: ------[ Device ID: 129 - Phys Heaps ]------
[234620.920515] MTGPU:  550: 0x000000007e129f9e -> Name: LMA, Type: LMA, CPU PA Base: 0x000001fc05220000, GPU PA Base: 0x05220000, Usage Flags: 0x00000002, Refs: 11, Free Size: 15716392960, Total Size: 17089388544
[234620.920518] MTGPU:  550: ------[ System Summary Device ID:1 ]------
[234620.920522] MTGPU:  550: Device System Power State: ON
[234620.920525] MTGPU:  550: MaxHWTOut: 20000000us, WtTryCt: 10000, WDGTOut(on,off): (10000ms,3600000ms)
[234620.920529] MTGPU:  550: ------[ AppHint Settings ]------
[234620.920533] MTGPU:  550:   Build Vars
[234620.920536] MTGPU:  550:     EnableTrustedDeviceAceConfig: N
[234620.920539] MTGPU:  550:     CleanupThreadPriority: 0x00000005
[234620.920543] MTGPU:  550:     WatchdogThreadPriority: 0x00000000
[234620.920546] MTGPU:  550:     HWPerfClientBufferSize: 0x000c0000
[234620.920549] MTGPU:  550:   Module Params
[234620.920556] MTGPU:  550:     none
[234620.920559] MTGPU:  550:   Debug Info Params
[234620.920564] MTGPU:  550:     CacheOpConfig: 0x0000000c
[234620.920568] MTGPU:  550:     CacheOpUMKMThresholdSize: 0xffffffff
[234620.920571] MTGPU:  550:   Debug Info Params Device ID: 1
[234620.920581] MTGPU:  550:     none
[234620.920584] MTGPU:  550: ------[ HTB Log state: Off ]------
[234620.920587] MTGPU:  550: ------[ Active Sync Checkpoints ]------
[234620.920612] sw: dmat-python-20911 @0 cur=0
[234620.920629] sw: dmat-python-20911 @0 cur=0
[234620.920637] sw: dmat-python-20911 @0 cur=0
[234620.920645] sw: dmat-python-20911 @0 cur=0
[234620.920653] sw: dmat-python-21057 @0 cur=0
[234620.920659] sw: dmat-python-21057 @0 cur=0
[234620.920665] sw: dmat-python-21057 @0 cur=0
[234620.920672] sw: dmat-python-21057 @0 cur=0
[234620.920678] ------[ Native Fence Sync: timelines ]------
[234620.920686] foreign_sync: @0 ctx=2 refs=1
[234620.920693] rogue-tdm: @0 ctx=103 refs=1
[234620.920699] rogue-tdm: @0 ctx=104 refs=1
[234620.920706] rogue-tdm: @0 ctx=105 refs=1
[234620.920712] rogue-cdm: @0 ctx=106 refs=1
[234620.920718] rogue-cdm: @0 ctx=107 refs=1
[234620.920724] rogue-cdm: @0 ctx=108 refs=1
[234620.920731] rogue-tdm: @0 ctx=120 refs=1
[234620.920737] rogue-tdm: @0 ctx=121 refs=1
[234620.920743] rogue-tdm: @0 ctx=122 refs=1
[234620.920749] rogue-cdm: @0 ctx=123 refs=1
[234620.920755] rogue-cdm: @0 ctx=124 refs=1
[234620.920762] rogue-cdm: @0 ctx=125 refs=1
[234620.920768] TS-python-21057: @1314 ctx=127 refs=1
[234620.920776] CC-python-21057: @258 ctx=128 refs=19
[234620.920910]  @240: (++) refs=1 fwaddr=0xf0095069 enqueue=1 status=Signalled 240-MUSA CDM Kick No.241
[234620.921061]  @241: (++) refs=1 fwaddr=0xf0095091 enqueue=2 status=Signalled 241-MUSA CDM Kick No.242
[234620.921197]  @242: (++) refs=1 fwaddr=0xf00950e1 enqueue=2 status=Signalled 242-MUSA CDM Kick No.243
[234620.921327]  @243: (++) refs=1 fwaddr=0xf0095149 enqueue=1 status=Signalled 243-MUSA CDM Kick No.244
[234620.921444]  @244: (++) refs=1 fwaddr=0xf0095129 enqueue=2 status=Signalled 244-MUSA CDM Kick No.245
[234620.921551]  @245: (++) refs=1 fwaddr=0xf0095079 enqueue=2 status=Signalled 245-MUSA CDM Kick No.246
[234620.921658]  @246: (++) refs=1 fwaddr=0xf0095141 enqueue=2 status=Signalled 246-MUSA CDM Kick No.247
[234620.921766]  @247: (++) refs=1 fwaddr=0xf00950a1 enqueue=1 status=Signalled 247-MUSA CDM Kick No.248
[234620.921874]  @248: (++) refs=1 fwaddr=0xf0095081 enqueue=1 status=Signalled 248-MUSA CDM Kick No.249
[234620.921983]  @249: (++) refs=1 fwaddr=0xf0095011 enqueue=2 status=Signalled 249-MUSA CDM Kick No.250
[234620.922095]  @250: (++) refs=1 fwaddr=0xf0095121 enqueue=2 status=Signalled 250-MUSA CDM Kick No.251
[234620.922207]  @251: (++) refs=1 fwaddr=0xf00950b9 enqueue=2 status=Signalled 251-MUSA CDM Kick No.252
[234620.922321]  @252: (++) refs=1 fwaddr=0xf00950f9 enqueue=2 status=Signalled 252-MUSA CDM Kick No.253
[234620.922433]  @253: (++) refs=1 fwaddr=0xf0095109 enqueue=2 status=Signalled 253-MUSA CDM Kick No.254
[234620.922538]  @254: (++) refs=1 fwaddr=0xf0095001 enqueue=2 status=Signalled 254-MUSA CDM Kick No.255
[234620.922649]  @255: (++) refs=1 fwaddr=0xf00950c9 enqueue=1 status=Signalled 255-MUSA CDM Kick No.256
[234620.922757]  @256: (++) refs=1 fwaddr=0xf0095031 enqueue=1 status=Signalled 256-MUSA CDM Kick No.257
[234620.922864]  @257: (++) refs=1 fwaddr=0xf0095049 enqueue=1 status=Signalled 257-MUSA CDM Kick No.258
[234620.922971] MTGPU:  550: ------------[ MTGPU DBG: END ]------------
[234620.922999] ------------[ cut here ]------------
[234620.923027] WARNING: CPU: 25 PID: 550 at /var/lib/dkms/mtgpu/1.0.0/build/src/pvr/osfunc.c:1191 OSWarnOn+0xf/0x20 [mtgpu]
[234620.923028] Modules linked in: 8021q garp mrp stp llc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif amd64_edac_mod edac_mce_amd kvm_amd kvm input_leds joydev binfmt_misc cdc_ether usbnet mii ipmi_si ccp ipmi_devintf k10temp ipmi_msghandler mac_hid sch_fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon efi_pstore ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor uas usb_storage raid6_pq libcrc32c raid1 raid0 multipath linear mtgpu(OE) ast drm_vram_helper crct10dif_pclmul hid_generic ttm crc32_pclmul ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea sysfillrect igb usbhid sysimgblt crypto_simd fb_sys_fops hid cryptd snd_pcm glue_helper drm snd_timer ahci dca i2c_algo_bit snd libahci soundcore nvme i2c_piix4 nvme_core
[234620.923056] CPU: 25 PID: 550 Comm: mtgpu watchdog Tainted: G           OE     5.4.0-169-generic #187-Ubuntu
[234620.923058] Hardware name: whatever
[234620.923078] RIP: 0010:OSWarnOn+0xf/0x20 [mtgpu]
[234620.923079] Code: 00 00 00 00 49 c7 44 24 38 00 00 00 00 eb c8 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 89 e5 85 ff 75 02 5d c3 <0f> 0b 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00
[234620.923079] RSP: 0018:ffffaebb49007db8 EFLAGS: 00010202
[234620.923080] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[234620.923081] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000001
[234620.923081] RBP: ffffaebb49007db8 R08: 000000000000099c R09: 0000000000000004
[234620.923082] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[234620.923082] R13: 0000000000000002 R14: ffff9542263faf18 R15: ffff95421a55ffa0
[234620.923083] FS:  0000000000000000(0000) GS:ffff95424ec40000(0000) knlGS:0000000000000000
[234620.923084] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[234620.923085] CR2: 00007f4011424000 CR3: 0000000b8e80a000 CR4: 0000000000340ee0
[234620.923085] Call Trace:
[234620.923091]  ? show_regs.cold+0x1a/0x1f
[234620.923094]  ? __warn+0x98/0xe0
[234620.923111]  ? OSWarnOn+0xf/0x20 [mtgpu]
[234620.923114]  ? report_bug+0xd1/0x100
[234620.923118]  ? do_error_trap+0x9b/0xc0
[234620.923119]  ? do_invalid_op+0x3c/0x50
[234620.923135]  ? OSWarnOn+0xf/0x20 [mtgpu]
[234620.923137]  ? invalid_op+0x1e/0x30
[234620.923155]  ? OSWarnOn+0xf/0x20 [mtgpu]
[234620.923184]  PVRSRVDebugRequest+0x532/0x640 [mtgpu]
[234620.923209]  DevicesWatchdogThread_ForEachVaCb+0xc9/0x120 [mtgpu]
[234620.923229]  ? LinuxEventObjectWait+0x16e/0x1c0 [mtgpu]
[234620.923251]  ? InterruptTimeoutThread+0x60/0x60 [mtgpu]
[234620.923278]  List_PVRSRV_DEVICE_NODE_ForEach_va+0x52/0x70 [mtgpu]
[234620.923301]  DevicesWatchdogThread+0x90/0x200 [mtgpu]
[234620.923320]  OSThreadRun+0x24/0x50 [mtgpu]
[234620.923324]  kthread+0x104/0x140
[234620.923341]  ? OSTimerCallbackWrapper+0x20/0x20 [mtgpu]
[234620.923342]  ? kthread_park+0x90/0x90
[234620.923343]  ret_from_fork+0x35/0x40
[234620.923344] ---[ end trace 10e04332c5857602 ]---
caizhi-mt commented 8 months ago

近期会更新一次驱动和torch_musa代码,请关注一下 驱动获取链接位置:https://developer.mthreads.com/sdk/download/musa?equipment=&os=&driverVersion=&version= (现在还没发布)

dixyes commented 6 months ago

近期会更新一次驱动和torch_musa代码,请关注一下 驱动获取链接位置:https://developer.mthreads.com/sdk/download/musa?equipment=&os=&driverVersion=&version= (现在还没发布)

developer站寄了 4202年了 cdn oss用起来吧 花不了多少钱 实在不行百度网盘也行啊 你们这个站太抽象了

dixyes commented 6 months ago

Fixed in mtgpu 2.5.0