RFC8613: Handling re-transmission requests

mrdeep1 commented 2 years ago

If a CON request (or its ACK response) is lost, the CON request will get re-transmitted. If there are NON requests following the initial CON request, the sequence number of the re-transmission may then fall outside of the replay window as in

A -x B: CON with sequence number n
A -> B: NON with sequence number n + 1
A -> B: NON with sequence number n + 2
...
A -> B: CON with sequence number n

A way forward here could be to increment the sequence number for each transmission (including re-transmission) so that everything stays within the replay window.

Then however, if the ACK response to a CON request is received by the client after the CON request has been re-transmitted, then when decoding the ACK, the external_aad will contain the updated Partial IV of the re-transmitted packet, not the original packet and then have some sort of decryption failure.

A -> B: CON with sequence number n
A -- B: ACK initiated 
A -> B: CON re-transmit with sequence number n + 1
A <- B: ACK received for sequence n

What is the correct way to handle a mix of CON and NON requests for OSCORE in a lossy environment?

chrysn commented 2 years ago

increment the sequence number for each transmission (including re-transmission)

That would be dangerous, as now the server couldn't do cryptographically checked deduplication any more: If it receives both the original request and a retransmit, and processes them both even though they're intended to be processed once. (Plus it negates a small performance advantage over DTLS that OSCORE has).

may then fall outside of the replay window

The default replay window is 32. The rate at which a client may send without seeing feedback from the server defaults to PROBING_RATE = 1 byte / second, and EXCHANGE_LIFETIME (maximum between start and when we expect a retransmit to come in) is 247 seconds. You can't send 32 OSCORE requests in 247 bytes.

(I don't think that PROBING_RATE is super tight, so one might send 3 or 4 or maybe even 10 (eg. Q-Block even though they use different parameters) NONs in a burst, but as the average PROBING_RATE is still limiting, that'd cause quite some silence later).

That's not a completely waterproof argument -- the NONs might get responses whereas all the CON retransmits fail, but that still requires an amount of unluckiness that's bordering a malicious agent on the network that deliberately slows down CONs -- and in that case, getting a "unprotect failed" sounds like the right behavior.

One option is always to increase the replay window. Beware that the size of the replay window (at least its worst case behavior, or best case behavior depending on point of view) needs to be agreed between the parties, so if you plan to alter the window on any but preconfigured OSCORE contexts, you might need to register parameters for choosing the replay window added to OSCORE_Input_Material of ace-oscore-profile (or however the contexts are configured).

mrdeep1 commented 2 years ago

It was someone else that raised this as an issue, and I am not convinced that increasing the sequence number is the way to go.

If it receives both the original request and a retransmit, and processes them both even though they're intended to be processed once

Same will happen if the returning ACK goes missing. The unwapped message would be an identical duplicate, so there should be no issues there.

That's not a completely waterproof argument ...

If someone updated NSTART to be > 1, then it is very easy to send a further 32 CON requests before the failing one gets re-transmitted.

Maybe that we have to make all OSCORE requests CON and only allow NSTART to be 1.

chrysn commented 2 years ago

and I am not convinced that increasing the sequence number is the way to go.

I don't understand; do you mean "increasing the replay window"?

Same will happen if the returning ACK goes missing. The unwapped message would be an identical duplicate, so there should be no issues there.

Not exactly: If an ACK goes missing, the retransmit (in the non-OSCORE case) has the same MID, and message deduplication steps in to prevent duplicate action.

Message deduplication in OSCORE can not rely on MIDs (as they are not end-to-end, and deliberately not protected against a malicious or broken proxy), and deduplicates on sequence numbers instead.

If someone updated NSTART to be > 1, then it is very easy to send a further 32 CON requests before the failing one gets re-transmitted.

If someone updated one parameter, they likely need to update others (as the original parameters are, to some extent, tuned to each other). Tuning the replay window along with it sounds pretty straightforward to me.

The mechanism that allows NSTART > 1 (we still don't have one, and that makes me sad) might also make new statements on the interleaving of retrnsmissions. I wouldn't be surprised if it allowed additional retransmissions of CONs that were long overdue on "tickets" of later successful requests, and then the sender (especially one that is aware of sequence numbers and replay window size) would preferably retransmit pending requests rather than sending yet another new one, also ensuring good use of buffers and good backpressure.

Maybe that we have to make all OSCORE requests CON and only allow NSTART to be 1.

I think that'd be excessive on two independent accounts:

Tuning parameters needs to be done with care; you can't expect to change one and keep all the system's properties. (For comparison, there have been two attempts (CoCoA and FASOR) so far to tune the retransmission parameters that are IMO easier than NSTART to change, and these have been dragging out for longer than I'd like -- and both had considerable research done before "just setting that parameter differently").

The analogy of flanging a power drill to a bike and then complaining that the brakes are too weak comes to mind, if you'll pardon a riddiculous comparison.
That failure mode is not specific to OSCORE. With plain CoAP and interleaved requests you can wind up with one just "timing out" even though others around it go through. With OSCORE, this just happens differently (in that a decryption error comes back) and a bit earlier (not after EXCHANGE_LIFETIME but if and when one of the 'hard' requests' retransmission does go through and there have been 32 others inbetween).

(Going all CON helps against the OSCORE-window error, but not against the bad-luck-with-retransmissions error. Even a TCP connection can be terminated by bad luck under thee conditions, and then the application ends up there.)

In both cases, the application can not tell whether or not the request was processed, and has to fall back to deciding whether and when another attempt can be made at application-level (for non-idempotent request, possibly after inspecting the current state of affairs).

mrdeep1 commented 2 years ago

and I am not convinced that increasing the sequence number is the way to go.

I don't understand; do you mean "increasing the replay window"?

I meant increasing the sequence number on a re-transmit - I am likewise not convinced that this is right. The replay window size is the thing to adjust.

Thanks for confirming with all the other information - I know which way to proceed short term.

kkrentz commented 2 years ago

Unfortunately, the issue mentioned by mrdeep is not the only one that arises when not incrementing the sequence number of a retransmitted CON. Another consequence will be that if an ACK gets lost, retransmitted CONs get ignored. So, clients may retransmit pointlessly. I also wonder if this behavior can be turned into a denial-of-sleep attack as dropping a single ACK will cause a high energy consumption. Nevertheless, I now see the problem with the unprotected MIDs.

chrysn commented 2 years ago

If the ACK is lost, the retransmitted CON hits the replay protection (b/c POST is not idempotent), and the response stored in the replay protection gets sent again.

(It may be possible to safely implement things w/o storing that if the response, but that's just some ideas in https://datatracker.ietf.org/doc/draft-amsuess-lwig-oscore/ )

mrdeep1 commented 2 years ago

Storing the set of responses for the replay window size in case they are needed for the unlikely re-transmit case (and all the necessary garbage collection) seems to be an overkill.

How about making use of the separate response? Request CON -> <- Empty ACK (gets lost) <- Response CON Empty ACK -> Request CON retry 1 -> <- Response 4.01 (Replay detected)

The retry will not take place if the Response CON is received as there will be a token match. If the Response CON fails, then that will get retried.

The OSCORE encrypt logic can take the response from the server application and inject the empty ACK along with selecting a new MID for the actual response.

Obviously any unsolicited response does not send an initial empty ACK , as well as any application doing separate responses (i.e. a proxy).

chrysn commented 2 years ago

Storing the set of responses is a general requirement of CoAP for non-idempotent requests. Sending a non-piggybacked response is generally a thing that a CoAP implementation that handles non-idempotent requests can do to keep the storage time low. OSCORE is just a special case of that, and the encrypt logic does not need to be involved. It's a choice of the CoAP stack below OSCORE to make.

The only input that an OSCORE implementation could give is to inform the CoAP stack that it does regard the request as an idempotent one, freeing the CoAP stack from all obligations to deduplicate or store responses. But to become idempotent it needs a replay optimization that'd need more review.

core-wg / corrclar

RFC8613: Handling re-transmission requests #24