antmicro / alkali-csd-fw

Apache License 2.0
2 stars 2 forks source link

existing race condition #19

Open twilfredo opened 1 year ago

twilfredo commented 1 year ago

Hey Guys!

I was wondering if we could get a bit more insight about the race condition that is mentioned here:

https://github.com/antmicro/alkali-csd-fw/blob/50f6c5f958527c4f81c463ca57306b6ef96a1210/rpu-app/src/cmd.c#L225

We are hitting a hard crash when IOs are ran but not immediately completed (IOs are sent to the APU to store into a different backend). So completion of an IO might not happen directly after it is received by the RPU from the host (which seems to be the case when using the ramdisk for storage).

Would you expect to see races given the above conditions ? I can get things to work, with a thread handling all IOs and synchronizing them to run only one after another... but this cuts out transfer speeds down to about 20MiB/s for writes and ~50-80MiB/s for read (based on io block size).

Any ideas on what maybe be happening, could it be the race above? why would it crash zephyr?

Thanks!

twilfredo commented 1 year ago

Turns out the issue I was facing was from invoking rpmsg_send() from an interrupt context (the function is blocking). This is a bug that maybe triggered by vendor_cb() -> send_cmd() which would be ran in an interrupt context upon DMA completion, if rpmsg_send() blocks, will likely crash.

So far I have this fixed by offloading the work to a thread.

rw1nkler commented 1 year ago

It's good to hear you solved the problem. If you find a satisfactory solution or want to discuss your changes, please open a pull request