amd / xdna-driver

Other
297 stars 39 forks source link

iohub register race condition fix #227

Closed mamin506 closed 1 month ago

mamin506 commented 1 month ago
  1. Add mailbox_res_record to mailbox_channel. Maybe it will be good that res record can also do some statistics work for helping debugging/analyzing, etc.. So, this is a good move.

  2. Move mailbox_rx_worker to before mailbox_irq_handler better readability

  3. In mailbox_irq_handler(), we aware that the clear iohub might race with set iohub from FW side. So that the iohub register is not able to trigger MSI-X interrupt. This leads to the application hangs. The idea is to fix this in host. In mailbox_irq_handler(), after clear iohub and launch worker, it keeps reading iohub for up to 4 times. If all these read are 0, this means there is no race during this period. Then the handler can exit safely. If any of the read is 1, this means FW want to trigger interrupt. The handler will clear iohub again and enqueue another work. This is not the perfect solution in theory. But based on the fact that handler is running very fast, and the FW to trigger next interrupt is slower. This change looks like very promising.

In my stress test, which disabled TDR in the driver, it can run overnight without issue. Without this change, my test will hang in less than half hour.