Rekeying failure on a busy link

MichelStam commented 8 years ago

Guys,

I've been testing an annoying bug I've been having on a mesh of (currently) 2 units for the past week or so, but I cannot seem to find what causes it.

It happens when I run an iperf test between the units. At or around the time the SAE lifetime expires, a rekey occurs, after which traffic between the units stops. Sometimes a packet arrives about a key lifetime later, but it does not get stable anymore.

If I leave the link idle (no iperf test, just some pings), then this problem does not seem to occur.

Looking at the debug traces from meshd-nl80211I can find no fault. I also looked at the key material sent down to the ath9k driver (printk's in the kernel driver), but even reading back those registers does not indicate to me that there's a fault.

Both units use an ath9k Atheros card; One is an AzureWave AR5B95, the other is a Compex WLE200N2-23. I have also observed the problem on Compex WLE350NX cards, so I am guessing this is not hardware related.

I set up both units with the attached config below; meshd.txt

The kernel I use 4.4.11, but I've seen the same problem with 3.10.49. The compat-wireless 2016-01-10 driver set used by OpenWRT seems to have the same problem with the old 3.10.34 kernel I run on that system.

The iperf setup is (using 2.0.5):

One system running iperf -s -u -p 6969 -i 5
One system running iperf -c -u -p 6969 -i 5 -t 86400 -b 100M

I create the mesh interfaces by:

iw phy phy0 interface add mesh0 type mp
ifconfig mesh0 IP MASK up
meshd-nl80211 -c meshd.txt -i mesh0

Right now the key lifetime is at 60 seconds for problem reproduction, but I have seen the same problem on a link with a key lifetime of 3600 seconds; the link then dies at that time.

Can anyone give me a couple of pointers where to look, or maybe help me out?

Regards,

Michel Stam

fhuberts commented 8 years ago

I think this is related to #39

MichelStam commented 8 years ago

I have not seen the messages: 'confirm did not verify!' anywhere in the logs, so I'm not sure about this.

Reading back I suppose I should have attached a log from one of the nodes as well. Its attached now. meshd-log.txt

I just noticed your commit, I will test again with the latest master tomorrow morning.

alexgrin commented 8 years ago

This appears to be a long-standing issue with ath9k. I've submitted this as a bug to ath9k mailing list last year ( https://www.mail-archive.com/ath9k-devel%40lists.ath9k.org/msg13595.html ) and there was an earlier report of very similar issues ( http://lists.shmoo.com/pipermail/hostap/2014-November/031377.html ). I was not able to get any traction. I've gone as far as reading the key back from the card registers and it matches what's expected. Our workaround was to have a unicast probe between the nodes that occurs right after rekey and, if it fails, rekey.

MichelStam commented 8 years ago

@alexgrin: I'm inclined to believe you on that, looking at the debugging you did. Do you have a patch of that workaround somewhere?

@fhuberts: I've tested with your patches, no changes as far as the bug is concerned.

fhuberts commented 8 years ago

On 23/06/16 18:10, Alexis Green wrote:

This appears to be a long-standing issue with ath9k. I've submitted this as a bug to ath9k mailing list last year ( https://www.mail-archive.com/ath9k-devel%40lists.ath9k.org/msg13595.html ) and there was an earlier report of very similar issues ( http://lists.shmoo.com/pipermail/hostap/2014-November/031377.html ). I was not able to get any traction. I've gone as far as reading the key back from the card registers and it matches what's expected. Our workaround was to have a unicast probe between the nodes that occurs right after rekey and, if it fails, rekey.

Would you mind sharing the code of your workaround?

alexgrin commented 8 years ago

The code is pretty awful looking but I'll see if I can button it up this weekend and push up to my repo.

On Fri, Jun 24, 2016 at 6:10 AM, Ferry Huberts notifications@github.com wrote:

On 23/06/16 18:10, Alexis Green wrote:

This appears to be a long-standing issue with ath9k. I've submitted this as a bug to ath9k mailing list last year ( https://www.mail-archive.com/ath9k-devel%40lists.ath9k.org/msg13595.html ) and there was an earlier report of very similar issues ( http://lists.shmoo.com/pipermail/hostap/2014-November/031377.html ). I was not able to get any traction. I've gone as far as reading the key back from the card registers and it matches what's expected. Our workaround was to have a unicast probe between the nodes that occurs right after rekey and, if it fails, rekey.

Would you mind sharing the code of your workaround?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cozybit/authsae/issues/42#issuecomment-228341047, or mute the thread https://github.com/notifications/unsubscribe/ABLZNlU0RFhCO7ZEYRMhnEpY7l_n6wjmks5qO9dDgaJpZM4I81o7 .

MichelStam commented 8 years ago

@alexgrin: I was looking at the workaround you described, I assume that you use the NL80211_CMD_PROBE_CLIENT netlink call for this? This would probably require a small patch in net/wireless/nl80211.c to make this work (this call is only allowed for AP and P2P interfaces, by default). Or do you probe differently?

alexgrin commented 8 years ago

Nope, it's nowhere near as awesome as you think. Authsae does a multicast ping (layer3) after rekey and waits to hear a unicast response from the a device with MAC address of the peer we just rekeyed with. If there's no response, rekey is triggered. You have to specify the interface for multicast to the daemon for this to work. It's a pretty nasty hackjob and I'll post the code as is (-ish) soon.

alexgrin commented 8 years ago

Here's the yuckyness - https://github.com/cococorp/authsae/commit/a1591d30b9eb8f9a31291a80d5b1a8e354666e74

fhuberts commented 8 years ago

thanks for sharing!

MichelStam commented 8 years ago

Thank you for the patch! I'll have a look at it.

Sorry for the radio silence, was testing a patch which may help find a solution. It works by resetting the ath9k chip when a new key is installed. It does seem to cause a LOT of authsae renegotiation traffic. The link does keep forwarding traffic. Although it is a start, I would hardly call this a nice patch, its the equivalent of using buckshot to swat mosquitoes.

Would someone mind checking it out and maybe suggesting a better approach?

ath9k-install_key-buckshot.diff.txt

MichelStam commented 8 years ago

I've posted the issue on the ath9k-devel list as well, hopefully I can stir something up/get the help to address this.

https://lists.ath9k.org/pipermail/ath9k-devel/2016-July/014676.html

MichelStam commented 8 years ago

So far 0 response, tried to bump it one time with no effect.

For the forseeable future, I've chosen to use software encryption.

chunyeow commented 8 years ago

Please note that your maximum achievable throughput will degrade if using software encryption.

On Wed, Jul 13, 2016 at 11:41 PM, MichelStam notifications@github.com wrote:

So far 0 response, tried to bump it one time with no effect.

For the forseeable future, I've chosen to use software encryption.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cozybit/authsae/issues/42#issuecomment-232396171, or mute the thread https://github.com/notifications/unsubscribe/ABBewvbaVf5ZQaXzL1faMdgbnG0gX9vuks5qVQdGgaJpZM4I81o7 .

MichelStam commented 8 years ago

Hello Chun-Yeow,

I agree, there is a measurable performance drop of about 2 Mbps. Luckily, for this particular application, high bandwidth is not the most important, but link stability is. I see it as a temporary solution so the project can move forward at my employers' side. When a fix is available, we're definitely switching back to hw encryption.

On a side note, I seem to have gotten a little traction on the ath9k-devel list; Adrian Chadd has taken a look at the Atheros reference driver, which seems to have a fix called ATH_SUPPORT_KEYPLUMB_WAR. This reinserts the key when there's Rx decryption errors. This fix may have to be ported into the ath9k driver.

Is it maybe an idea for those of you that have run into this issue as well at some point or the other to pitch in on the ath9k-devel list?

Cheers,

Michel

fhuberts commented 8 years ago

Can you point me to that thread? It's very relevant for our deployment as well

MichelStam commented 8 years ago

Of course, https://lists.ath9k.org/mailman/listinfo/ath9k-devel - The list https://lists.ath9k.org/pipermail/ath9k-devel/2016-July/014676.html - The thread

fhuberts commented 8 years ago

tnx

MichelStam commented 8 years ago

I recently got a mail from Sven Eckelman about a patch which may solve the situation: https://lkml.kernel.org/r/20161018083552.28592-1-a@unstable.cc

I have not yet had time to take a look at this, so caveat emptor.

fhuberts commented 8 years ago

thanks. will try to see if this works in our setup. please let me know more if you have more information :-)

fhuberts commented 8 years ago

Is that patch being upstreamed?

MichelStam commented 8 years ago

No it is not. Sven add the following to the message:

The patch itself has (at least) one big problem. It is using some mac80211 internals in ath_key_config_iter to make sure that the uploaded keys were actually programmed in the hardware. Without this check the keys could end up in the lower slots and thus break all connections.

So this patch could be a starting point for someone who wants to add a workaround which is acceptable by upstream.

Here is the original email: http://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg14458.html

fhuberts commented 8 years ago

I just contacted Antonio via email, offering help in upstreaming. Are you willing to help here? Maybe Bob and Alex can participate as well?

MichelStam commented 8 years ago

Sure, I was planning to do this somewhere in the coming days. Maybe I can re-use some of the kludge I wrote up to get around the use of internals (unless I am doing that myself as well).

fhuberts commented 8 years ago

shall we continue via direct email? mailings (at) hupie (dot) com

MichelStam commented 7 years ago

After spending 2 weeks on this issue together with Ferry Huberts, we did not get much further.

We tried:

Introducing a worker process that gets activated on a key_set or whenever a number of decrypt_error's occur. In the former case it does not seem to have any effect, and decrypt_errors do not seem to occur very often. Hence the decrypt_error does not increment enough. Setting the limit to 1 does not help, either
Introducing the KEYPLUMB_WAR as described in [https://github.com/qca/qcamain_open_hal_public/blob/master/hal/ar9300/ar9300_keycache.c], see below. This too does not work, the write and subsequent check indicate that the data is written correctly at the first attempt. The extra xorKey argument introduced does not solve this issue other than adding an extra read/write cycle.
We tried to silence the chip completely by disabling the interrupts, cancelling the tasks (and waiting for completion). This does not solve the issue other than adding instability
Lastly we tried to schedule a worker that replumbs the key 2, 5 or 10 seconds after a set_key. This too, does not help.

It seems to me that the chip gets very confused when a key is installed while it is processing a lot of traffic. Quieting the chip does not seem to help, unless I did not get it quiet enough.

I have attached the various patches as an example of what was tried.

Replumb on decrypt_error and key_set 0001-ath9k-Implement-key-cache-corruption-work-around.patch.txt
Open HAL key replumb: 0001-Rework-of-the-key-plumbing-using-ath-hal-code.patch.txt
Replumb x seconds after key_set 0001-ath9k-rekey-after-10-seconds.patch.txt

I lost the patch which quiets the chip prior to keying; accidental delete .... description; in at9k_set_key (ath9k/main.c), just before the switch statement, add: spin_lock_bh(&sc->sc_pcu_lock); ath9k_hw_disable_interrupts(ah); tasklet_disable(&sc->intr_tq); spin_unlock_bh(&sc->sc_pcu_lock);

Then, after the switch statement, add: spin_lock_bh(&sc->sc_pcu_lock); tasklet_enable(&sc->intr_tq); ath9k_hw_enable_interrupts(ah); spin_unlock_bh(&sc->sc_pcu_lock);

Adrian Chadd has suggested in an email to try my original buckshot patch ath9k-install_key-buckshot.diff.txt, but this time reinstall keys after the reset. I will try to find some time and do this, see if it helps.

Cheers,

Michel

fhuberts commented 7 years ago

Yes, I have never seen hardware behave so baffling and I suspect we will not get any further unless we get more information on what is really going on. very very unfortunate.

MichelStam commented 7 years ago

Ok. So after a few valiant attempts, I got a little further, but still nowhere close to a working patch. Resetting the chip, then replumbing the cache seems to be the way to go.

From a bug which is triggered every rekey, I'm now at a situation where the error usually every couple of minutes. The max I got was 500 seconds on every 60 seconds rekeying. I'm getting the idea that the chip must be reset as quickly as possible after a key insertion, otherwise the chip gets confused.

Another thing which significantly delayed my progress is the sheer amount of locking in the ath9k driver. I needed to grab the rtnl_lock in order to access the key material in mac80211, but sc->mutex also seems to be required. Grabbing both is inviting all sorts of locking issues which usually result in the whole network stack hanging. As a quick hack, I stopped using sc->mutex and grabbed only rtnl_lock (not doing so will cause a BUG_ON every rekey).

Please take a close look at this patch. It is by no means complete or clean yet, so not ready for production. 0002-ath9k-reset-and-replumb-key.patch.txt

fhuberts commented 7 years ago

I do have a fix in authsae (rekeying) but need to test it further

fhuberts commented 7 years ago

My rekey code works well. The issues we had with ath5k systems were related to ath5k barfing over 'htmode=HT20'. I've re-opened my rekey PR (https://github.com/cozybit/authsae/pull/55)

xbing6 commented 7 years ago

One question, why do you use authsae rather than wpa_supplicant, I'd think wpa_supplicant is much more widely used?

bcopeland commented 7 years ago

On Sat, Apr 22, 2017 at 12:14:59AM -0700, Xuebing Wang wrote:

One question, why do you use authsae rather than wpa_supplicant, I'd think wpa_supplicant is much more widely used?

At the time authsae was created, wpa_supplicant didn't support SAE.

Now it does -- actually cozybit contributed that support for wpa_supplicant.

MichelStam commented 7 years ago

Personally, I had some issues with wpa_supplicant in combination with OpenWRT. Some race condition which prevented either the AP or mesh function from working. Did not have this problem when starting everything manually, just when using the OpenWRT configuration system. Since AuthSAE did work, and seemed more stable at the time I settled for that.

xbing6 commented 7 years ago

Thanks for your answering.

I was having race condition with dnsmasq (for Ethernet). Not sure if below link is of any help. https://dev.openwrt.org/ticket/7423

zhejunli commented 6 years ago

Any final solutions on this? I have exactly same issue in ath9k. Thanks!

MichelStam commented 6 years ago

Nope, sorry.

Michel Stam

On 3 May 2018, at 23:06, zhejunli notifications@github.com wrote:

Any final solutions on this? I have exactly same issue in ath9k. Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

erikarn commented 6 years ago

... I think I'm finally hitting it in ath9k at my current employer. Let's see how far down the rabbit hole I go.

bcopeland commented 6 years ago

On Mon, Jul 02, 2018 at 05:31:06PM -0700, Adrian Chadd wrote:

... I think I'm finally hitting it in ath9k at my current employer. Let's see how far down the rabbit hole I go.

FWIW I think the way authsae does rekeying could be reworked to avoid this -- it rekeys the PMK but we could/should rekey the MTK instead. Then there'd be a different key id so both keys could be present in hardware for a short time.

But for MGTK I don't think there's an equivalent solution.

erikarn commented 6 years ago

Hi,

I'm going to experiment with installing a second key and then blanking out the first one, or maybe blanking out the first one before adding the second. The challenge is figuring out whether the keycache will let you get away with such hijinx for the peer key.

I'll also see if stopping RX whilst programming unicast keycache slot updates helps. At least in net80211 you get a "i'm going to do a keycache update in a sec" so you can batch keycache updates behind say, stopping the TX/RX path. I dunno whether that's easy on mac80211 but I'll see.

-adrian

On Mon, 2 Jul 2018 at 18:08, Bob Copeland notifications@github.com wrote:

On Mon, Jul 02, 2018 at 05:31:06PM -0700, Adrian Chadd wrote:

... I think I'm finally hitting it in ath9k at my current employer. Let's see how far down the rabbit hole I go.

FWIW I think the way authsae does rekeying could be reworked to avoid this -- it rekeys the PMK but we could/should rekey the MTK instead. Then there'd be a different key id so both keys could be present in hardware for a short time.

But for MGTK I don't think there's an equivalent solution.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cozybit/authsae/issues/42#issuecomment-401982430, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGl7ZDiq_vYkUU2nngAzb59Ql8F126Bks5uCsQfgaJpZM4I81o7 .

erikarn commented 6 years ago

interesting. if I replumb the keys on the receiver side then it doesn't fix things. That's ... odd.

erikarn commented 6 years ago

(so I wonder if there are two bugs here..)

zhejunli commented 6 years ago

Per my understanding:

It is not specific to Authsae. It is a common issue.
As someone mentioned it is an ath9k chip h/w bug, I have read this : "https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwia6e2Q95LcAhVBGDQIHdWnDEMQFggoMAA&url=https%3A%2F%2Fmentor.ieee.org%2F802.11%2Fdcn%2F10%2F11-10-0018-00-000m-4-way-handshake-synchronization-issue.ppt&usg=AOvVaw2lqr2wOTS6MzRh8cob18jy" Looks like they're kind of related.

erikarn commented 6 years ago

Oh I know it's an ath9k chip bug. :-) That's what I have on my desk atm

erikarn commented 6 years ago

Oh I know it's an ath9k chip bug. :-) That's what I have on my desk atm

zhejunli commented 6 years ago

I mean, the link shows a key installation deficiency in the 4-way handshake mechanism. Following that idea to solve that key re-installation problem may help to solve this "ath9k chip bug" too.

fhuberts commented 6 years ago

my rekeying patch (which was merged in #59 ) does that, and works well, except when the chip is under heavy load.

erikarn commented 6 years ago

Yeah, the "heavy load" issue is our issue here. I'll keep digging and see what I can find.

Is your patch focused on rekeying the transmit side key, or the remote/receiver key? (yes the CCMP keys are used for both TX/RX, I'm more interested in which side is doing the replumbing of keys to the HW.)

erikarn commented 6 years ago

ok, yeah. I'm seeing three separate bugs:

Somehow mac80211's PN tracking gets messed up on the receive side and it gets set to 1, which means all subsequent frames get rejected as CCMP replay until it reaches the old sequence couner;
The receiver sometimes needs replumbing; looks about 10% of the time;
The /transmitter/ sometimes needs replumbing, about 50% of the time.

I'm doing a UDP iperf of a few tens of mbit from an ath9k AP -> ath9k STA (both Peacock / AR9580) to reproduce this. In all cases the right keys make it through to the keycache code. When it breaks then at least whenever I've caught it the STA can still send frames to the AP which the AP can decrypt, but the frames from the AP can't be decrypted by the STA.

I wonder if the sender side hardware bug will suck less if we just completely pause transmit before doing a rekey (because that's what the rekeying patch seems to detect). The RX side rekey thing that QCA does is a different beast and I think fixes another bit of the problem - I'm not actively TXing (besides ACKs, obviously) from the STA -> AP during the failure mode, so if there's a hardware bug it's likely due to having a packet in receive flight during rekeying.

(Maybe I can experiment with pausing the MAC TX/RX whilst replumbing the key, which would have the added benefit of not ACKing anything during that window..)

erikarn commented 6 years ago

Ok, so bug 1 here was fixed by just disabling PTK rekey. It turns out the data/control path for transmit and mac80211 key management is not really setup for doing seamless PTK rekey at least on the transmit side, and you can't guarantee to not drop frames on the receive side either. So, I'm not going to do it.

Which means the second and third don't happen.

Now, if someone has some spare time (and maybe me too) I think we should experiment with trialling draining the ath9k station queue so we aren't transmitting anything that can use that keycache entry before we plumb in the new key. It's tricky because mac80211/ath9k aren't setup for that. But I /think/ that'll work around the TX keycache bug.

The RX keycache bug shouldn't be triggered if you're not actively receiving packets to decrypt whilst you're changing the key - which you can guarantee if your AP is not doing stupid crap (ie, has this bug fixed) but you can't otherwise; that particular one will benefit from the keycache plumb hack from QCA. However, and here's the rub - it doesn't seem to work reliably if you're constantly hitting it with a stream of packets. It really needs to sneak in when no active RX is being done for that keycache entry.

alexw65500 commented 6 years ago

I've found the same issue with rekeying a PTK under load and are currently trying to upstream a fix for that. In fact there are different ways how normal kernels can mess up the PN, but I think we found them all now. I've not yet tested it with mesh networks but I'm quite confident the current version of the patch will fix the issue for ath9k mesh also. And I would like to get more feedback on the patch, so if you want to test the current version here it is: https://patchwork.kernel.org/project/linux-wireless/list/?series=8045&state=*

You will get warnings that the userspace (wpa_supplicant) is requesting rekeys while it should not which will need patches to either ath9k or wpa_supplicant which simply are not available, yet. But the "fallback" path printing this warning is working quite nicely with an ath9k AP and I expect it to do the same for ath9k in a mesh.

When testing the patch I suggest you also make sure you have this fix applied: https://patchwork.kernel.org/patch/10399613/ While I believe ath9k will not trigger this race the obvious symptom are pretty much the same so better make sure it's fixed also.

Edit: updated link

cozybit / authsae

Rekeying failure on a busy link #42