Lack of locking - Githubissues

ozbenh commented 5 years ago

Hi !

I looked at your driver and I think quite a bit of your stability problems could relate to the absence of meaningful locking in there :)

You plan a dance with atomics here or there but that's really not going to fly. I reckon you should have a per-queue spinlocks and for general non-interrupt housekeeping, use a mutex. Esp. with USB, you will be called at random times including softirq, etc...

One interesting question is how often do you need to use the mailbox. It might be worthwhile considering making it a request queue with interrupt completions, up to you...

MCMrARM commented 5 years ago

The VHCI code does have all required locking as far as I am aware. All of the atomics were carefully placed. It might be possible that some of the BCE code might lack some locking (actually I am aware of one case of this - queue creation has no locking, this causes issues if one wanted to destroy a queue; a bigger issue is that I learned about spin_lock_irqsave only after using spin_lock caused me a global system freeze) but there's an important design decision there: it's up to the user (VHCI) to lock on submit and message polling (and you can indeed find some locks in the VHCI code). I only support processing a single MSI interrupt at a time, so generally the code does not require locking there (the queue code is like a low level implementation, it does minimal locking and uses more atomics as they are faster and there are no issues with concurrent access there). I got thru most of the stability issues other than some regarding the VHCI global command queue, where bad things (mismatched replies) happen after a command times out, and I think that happens because I originally thought the command queue is a queue but apparently it seems it's single-request-at-a-time (the firmware only seems to support cancelling the last command). The timeout in question happen during some specific device->host interactions, not sure about the details because it's not common. I'll have to verify if the queue is indeed single-request later by reversing the OS X driver. The biggest issue that I am facing right now, isn't stability, it's latency. The latency is just plain terrible. Enable the camera and the mouse lags by like 50ms. That's just terrible. I'm not even sure what am I doing wrong exactly, the code should be "fast".

Mailbox is virtually unused, no reason to optimize it. It's only used like 3 times during setup and it's designed to be synchronous.

MCMrARM commented 5 years ago

As for writing a message to the queue, this following pattern has to be used in multi-threaded code:

if (bce_reserve_submission(q, timeout))
  goto error;
lock()
s = bce_next_submission(q)
// write to s
bce_submit_to_device(q)
unlock()

The reason for that is because bce_reserve_submission can block (it uses an atomic+completion internally), and I didn't want to hurt code that's single-threaded. If bce_reserve_submission returns 0, either a message must be sent or the reservation be cancelled using bce_cancel_submission_reservation(). And while writing it I just realized bce_cancel_submission_reservation has a bug, will fix it later (doesn't notify code that waits for submissions if the queue had no empty items left).

MCMrARM commented 5 years ago

For handling IRQs, all code generally only has a single consumer thread currently. It doesn't have to be the IRQ thread, but for concurrent access a spinlock would have to be used. Concurrent access to the completions makes no sense whatsoever though, as the queues as ordered, so that'd bring along a huge disaster if done. Anyways, the api for consuming completions uses bce_next_completion. This lets you access the completion but it does not actually let the submission code reuse it. Code using it must ALWAYS also call bce_notify_submission_complete in order to let the code reuse the taken queue indexes.

MCMrARM commented 5 years ago

Another note: queue_dma is pretty much dead code, since it's never used anywhere.

MCMrARM commented 5 years ago

I'll also restructure the driver later probably so the VHCI is like a plugin to the BCE, currently it's hard-tied but that was just a hack.

MCMrARM commented 5 years ago

Anyways, yeah, any ideas regarding the insane latency would be welcome, I'll be probably trying to use a kernel performance tracker tomorrow, if the issue turns out to be the uvcvideo driver, I'll be really mad.

ozbenh commented 5 years ago

On Mon, 2019-07-22 at 11:37 -0700, MCMrARM wrote:

The VHCI code does have all required locking as far as I am aware. All of the atomics were carefully placed. It might be possible that some of the BCE code might lack some locking (actually I am aware of one case of this - queue creation has no locking, this causes issues if one wanted to destroy a queue; a bigger issue is that I learned about spin_lock_irqsave only after it caused me a global system freeze) but there's an important design decision there: it's up to the user (VHCI) to lock on submit and message polling (and you can indeed find some locks in the VHCI code). I only support processing a single MSI interrupt at a time, so generally the code does not require locking there (the queue code is like a low level implementation, it does minimal locking and uses more atomics as they are faster and there are no issues with concurrent access there). I got thru most of the stability issues other than some regarding the VHCI global command queue, where bad things (mismatched replies) happen after a command times out, and I think that happens because I originally thought the command queue is a queue but apparently it seems it's single-request-at-a-time (the firmware only seems to support cancelling the last command). The timeout in question happen during some specific device->host interactions, not sure about the details because it's not common. I'll have to verify if the queue is indeed single-request later by reversing the OS X driver. The biggest issue that I am facing right now, isn't stability, it's latency. The latency is just plain terrible. Enable the camera and the mouse lags by like 50ms. That's just terrible. I'm not even sure what am I doing wrong exactly, the code should be "fast". Mailbox is virtually unused, no reason to optimize it. It's only used like 3 times during setup and it's designed to be synchronous.

I don't have much time to reply in depth but atomics being faster than locks is generally untrue and hand-made synchronization with atomic is almost universally broken, especially on non-x86 platforms where it tend to miss important memory barriers (or on the contrary use to heavy ones thus impairing ... performance).

Locks are almost always the right thing to use. They also come with much more auditing and debugging tools (such as lockdep).

Effectively you need to ensure your queues have a lock for submission and interrupt processing (but drop it when calling completion callbacks).

Don't forget to irqsave anything that can be interrupted, though the interrupt itself often doesn't need to.

Keep in mind that timers are softirqs (so in effect interrupts but can be interrupted by HW interrupts).

As for latency, I don't know for sure, are you sharing queues ? Do you have per-queue interrupts ? Are you properly handling all pending completions in a single interrupt ? Are you queueing up enough requests in advance ? etc...

Cheers, Ben.

MCMrARM commented 5 years ago

I thought spinlocks enforce a memory barrier anyways? How would the sync work otherwise? Also, we're only targetting ia64 here. I could replace the atomics with spinlocks, but I believe my use of atomics is valid, I'll reconsider it later as I think I may be able to improve the performance if I use some more global locks.

Submission queues do have locks, they're just in the VHCI (user) code, not in the queue code.

There's only a single interrupt globally and there's not really anything I can do about it. This is just how the OS X driver works. Each VHCI device gets it's own queue set. I don't think it's an issue with actually queueing up anything, the latency happens as soon as you enable webcam, so it's like, the webcam seems to put either the device, the host, or both with so much traffic that some part of the code can't keep up. I think I can single the device out as using the webcam is OS X has no significant effects. So, it leaves my code and the uvc driver as the possible culprits. My handling code is not super fast though, in many cases it takes 0.1ms to fully process the interrupts (including the USB device drivers handling) so it's probably improve-able.

ozbenh commented 5 years ago

On Tue, 2019-07-23 at 02:10 -0700, MCMrARM wrote:

I thought spinlocks enforce a memory barrier anyways? How would the sync work otherwise? Also, we're only targetting ia64 here. I could replace the atomics with spinlocks, but I believe my use of atomics is valid, I'll reconsider it later as I think I may be able to improve the performance if I use some more global locks. Submission queues do have locks, they're just in the VHCI (user)

x86_64, not ia64 :-) But yeah, in this case the barriers don't matter as much as, for ex, on ARM. That said, locks are more efficient than hand-made-synchro-with-atomics most of the time. For example locks dont usually need an atomic op for unlocking. They make also the code clearer.

As for your perf issue, do you process all the data in the queue from the interrupt ? (ie, pass it on the "client" driver ?). It could be that's where uvc takes time and hogs other clients ? You could try delaying to tasklets or threads I suppose. All USB traffic goes through the same queue or is it per device ?

code, not in the queue code. There's only a single interrupt globally and there's not really anything I can do about it. This is just how the OS X driver works. Each VHCI device gets it's own queue set. I don't think it's an issue with actually queueing up anything, the latency happens as soon as you enable webcam, so it's like, the webcam seems to put either the device, the host, or both with so much traffic that some part of the code can't keep up. I think I can single the device out as using the webcam is OS X has no significant effects. So, it leaves my code and the uvc driver as the possible culprits. My handling code is not super fast though, in many cases it takes 0.1ms to fully process the interrupts (including the USB device drivers handling) so it's probably improve-able. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

MCMrARM commented 5 years ago

I have underestimated the expensiveness of atomics, I'll have to try to avoid them more.

The queues are separate between devices as far as queues go but all the deliveries are done from a single thread (there's a single interrupt). Delivering the completions from different threads could be attempted and I have thought about it but I am not 100% sure if that's what the issue.

MCMrARM commented 5 years ago

The driver seems stable now, and even if it's not, I don't think it's missing any locking.

ozbenh commented 5 years ago

No worries, i'll try to review it in more details when I get a chance (travelling at the moment)

MCMrARM commented 5 years ago

I think I'll be thinking about upstreaming this soon (probably at the end of this month), but first I want to fix suspend and latency issues.

MCMrARM commented 5 years ago

(I won't be upstreaming the audio part, only the VHCI)

MCMrARM / mbp2018-bridge-drv

Lack of locking #1