dtaht / sch_cake

Out of tree build for the new cake qdisc
100 stars 35 forks source link

Ingress host/flow isolation difficulties #145

Open xnoreq opened 3 years ago

xnoreq commented 3 years ago

Hello,

I have a router with

eth0.2 has an ingress qdisc and a tc filter with action mirred egress redirect dev ifb0.

ifb0 has a cake qdisc with a 100 Mbps bandwidth limit and the options ingress dual-dsthost nat besteffort.

Now having a heavy (that consumes all available bandwidth) and a light (a few kB/s) UDP flow from a single LAN host to two different hosts on the Internet results in packet drops in the light flow. Why? Shouldn't dual-dsthost provide flow isolation and fairness over those two flows and therefore drop from the heavy flow instead of the light one?

Is there a better workaround other than using iptables hashlimit to detect heavy flows (basically do what cake should do), mark them and use this with something like diffserv3 and another tc filter to override the tin, to "de-prioritize" those flows?

Or is something broken here?

chromi commented 3 years ago

On 1 Nov, 2020, at 9:07 pm, xnoreq notifications@github.com wrote:

I have a router with

• eth0.2 ... connects to WAN • eth0.1, wifi0 ... bridged to br-lan ... bridge for LAN eth0.2 has an ingress qdisc and a tc filter with action mirred egress redirect dev ifb0.

ifb0 has a cake qdisc with a 100 Mbps bandwidth limit and the options ingress dual-dsthost nat besteffort.

Now having a heavy (that consumes all available bandwidth) and a light (a few kB/s) UDP flow from a single LAN host to two different hosts on the Internet results in packet drops in the light flow.

You don't mention having any egress qdisc on eth0.2 - that is what would control traffic in that direction. That's what you would get with "tc qdisc replace dev eth0.2 root cake …"

xnoreq commented 3 years ago

Sorry, my wording was bad. The flows are downloads and in my example only ingress is relevant.

xnoreq commented 3 years ago

To clarify again: 2 flows which are basically "downloads". Looking at ingress only because that's where the packets are discarded because bandwidth limits are hit. (Of course there's also egress and I do have another cake instance set up there as well, but this is irrelevant for the example because data is sent at well below the egress limit.)

One "connection" is greedy in the sense that it will use the full 100 Mbps if available while the other is fairly constant and uses less than ~0.5 Mbps.

I am seeing packets getting dropped from the <0.5 Mbps connection. Why is this happening?

xnoreq commented 3 years ago

Looking through the code, I do have some questions. For example, why is there no difference between one or zero bulk flows in terms of deficit? On ack drop with CAKE_FLAG_INGRESS, why is deficit not decremented? Same question for cake_drop function. Why is deficit only added to in case it was negative and not for every flow for every time we dequeue/drop? Doesn't this cause unfairness? Could a bursty flow constantly go into decaying state or be removed and re-added as a new sparse flow, effectively ruining fairness/isolation? If that is happening then it also doesn't help that the deficit is reset for every "new" flow.

Thanks.

tohojo commented 3 years ago

xnoreq notifications@github.com writes:

Looking through the code, I do have some questions. For example, why is there no difference between one or zero bulk flows in terms of deficit?

Eh? What do you mean?

On ack drop with CAKE_FLAG_INGRESS, why is deficit not decremented? Same question for cake_drop function.

You have this backwards: deficit and shaper is only advanced on drop if the ingress flag is set. The counters count how much data the flow actually transmits, and the logic is that on ingress, the packet has already traversed the bottleneck, so it's counted against what the flow has transmitted even though we drop it.

Why is deficit only added to in case it was negative and not for every flow for every time we dequeue/drop? Doesn't this cause unfairness?

No, it doesn't. The refilling of deficit is done one each rotation of the queue order; this is how DRR scheduling works. See https://en.wikipedia.org/wiki/Deficit_round_robin

Could a bursty flow constantly go into decaying state or be removed and re-added as a new sparse flow, effectively ruining fairness/isolation? If that is happening then it also doesn't help that the deficit is reset for every "new" flow.

Yes, that's the idea. Each time a flow goes away it'll be 'sparse' the next time it comes around. This is a feature; and the decay makes sure that a flow can only do this if it uses less that its fair share of the link, so it doesn't ruin fairness.

xnoreq commented 3 years ago

xnoreq notifications@github.com writes: Looking through the code, I do have some questions. For example, why is there no difference between one or zero bulk flows in terms of deficit?

Eh? What do you mean?

host_load = 1;
host_load = max(host_load, dsthost->dsthost_bulk_flow_count);
flow->deficit += (b->flow_quantum * quantum_div[host_load] + (prandom_u32() >> 16)) >> 16;

Therefore zero and one bulk_flow_count result in the same host_load.

Why not start from zero and change the initialization code of quantum_div from: https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2894-L2896 to something like this:

    for (i = 0; i <= CAKE_QUEUES; i++)
        quantum_div[i] = 65535 / (i+1);

Also, looking at the access to quantum_div with an index based on bulk_flow_count, what prevents an out-of-bounds access here if bulk_flow_count goes above CAKE_QUEUES?

On ack drop with CAKE_FLAG_INGRESS, why is deficit not decremented? Same question for cake_drop function.

You have this backwards: deficit and shaper is only advanced on drop if the ingress flag is set. The counters count how much data the flow actually transmits, and the logic is that on ingress, the packet has already traversed the bottleneck, so it's counted against what the flow has transmitted even though we drop it.

I understand that and I guess that it would be even better if that accounting could be done as soon as the packet is received in ingress mode rather than later when the packet is dropped/transmitted. But back to my point, where is deficit supposed to be touched here? https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L1906-L1917 cake_advance_shaper does not touch deficit either, hence why it's decremented separately e.g. in cake_dequeue after calling cake_advance_shaper: https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2295-L2300 And it's not touched in cake_drop either: https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L1650-L1652

My point is that this seems to have been overlooked for the ACK-filtering.

Also speaking of ACK-filtering, I assume that there is a good technical reason why ACK-filtering doesn't happen in the GSO case?

Why is deficit only added to in case it was negative and not for every flow for every time we dequeue/drop? Doesn't this cause unfairness?

No, it doesn't. The refilling of deficit is done one each rotation of the queue order; this is how DRR scheduling works. See https://en.wikipedia.org/wiki/Deficit_round_robin

Reading through the code more carefully I think I now get how this is working. Adding to the deficit counter only after it's become <= 0 is a smart way to not have to remember whether the deficit was increased between consecutive visits of the flow, as would be necessary if it was implemented like the pseudo-code on Wikipedia. So what's left to ensure for fairness is that all flows are visited in order, which may or may not be the case. I have to think this through some more.

Could a bursty flow constantly go into decaying state or be removed and re-added as a new sparse flow, effectively ruining fairness/isolation? If that is happening then it also doesn't help that the deficit is reset for every "new" flow.

Yes, that's the idea. Each time a flow goes away it'll be 'sparse' the next time it comes around. This is a feature; and the decay makes sure that a flow can only do this if it uses less that its fair share of the link, so it doesn't ruin fairness.

But if that's the case, what's preventing such a bursty flow from starving a much more well-behaved and lower rate flow? That scenario would actually fit the behavior that I observed.

tohojo commented 3 years ago

xnoreq notifications@github.com writes:

xnoreq notifications@github.com writes: Looking through the code, I do have some questions. For example, why is there no difference between one or zero bulk flows in terms of deficit?

Eh? What do you mean?

host_load = 1;
host_load = max(host_load, dsthost->dsthost_bulk_flow_count);
flow->deficit += (b->flow_quantum * quantum_div[host_load] + (prandom_u32() >> 16)) >> 16;

Therefore zero and one bulk_flow_count result in the same host_load.

Why not start from zero and change the initialization code of quantum_div from: https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2894-L2896 to something like this:

  for (i = 0; i <= CAKE_QUEUES; i++)
      quantum_div[i] = 65535 / (i+1);

shrugs, because that is now how the code was written?

Also, looking at the access to quantum_div with an index based on bulk_flow_count, what prevents an out-of-bounds access here if bulk_flow_count goes above CAKE_QUEUES?

How would bulk_flow_count go above CAKE_QUEUES?

On ack drop with CAKE_FLAG_INGRESS, why is deficit not decremented? Same question for cake_drop function.

You have this backwards: deficit and shaper is only advanced on drop if the ingress flag is set. The counters count how much data the flow actually transmits, and the logic is that on ingress, the packet has already traversed the bottleneck, so it's counted against what the flow has transmitted even though we drop it.

I understand that and I guess that it would be even better if that accounting could be done as soon as the packet is received in ingress mode rather than later when the packet is dropped/transmitted. But back to my point, where is deficit supposed to be touched here? https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L1906-L1917 cake_advance_shaper does not touch deficit either, hence why it's decremented separately e.g. in cake_dequeue after calling cake_advance_shaper: https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2295-L2300 And it's not touched in cake_drop either: https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L1650-L1652

My point is that this seems to have been overlooked for the ACK-filtering.

Why would you combine ingress mode and ACK filtering? That makes no sense...

Also speaking of ACK-filtering, I assume that there is a good technical reason why ACK-filtering doesn't happen in the GSO case?

The ACK filter only drops pure ACKs; those are not GSO packets...

Why is deficit only added to in case it was negative and not for every flow for every time we dequeue/drop? Doesn't this cause unfairness?

No, it doesn't. The refilling of deficit is done one each rotation of the queue order; this is how DRR scheduling works. See https://en.wikipedia.org/wiki/Deficit_round_robin

Reading through the code more carefully I think I now get how this is working. Adding to the deficit counter only after it's become <= 0 is a smart way to not have to remember whether the deficit was increased between consecutive visits of the flow, as would be necessary if it was implemented like the pseudo-code on Wikipedia.

So what's left to ensure for fairness is that all flows are visited in order, which may or may not be the case. I have to think this through some more.

Yes, they are.

Could a bursty flow constantly go into decaying state or be removed and re-added as a new sparse flow, effectively ruining fairness/isolation? If that is happening then it also doesn't help that the deficit is reset for every "new" flow.

Yes, that's the idea. Each time a flow goes away it'll be 'sparse' the next time it comes around. This is a feature; and the decay makes sure that a flow can only do this if it uses less that its fair share of the link, so it doesn't ruin fairness.

But if that's the case, what's preventing such a bursty flow from starving a much more well-behaved and lower rate flow? That scenario would actually fit the behavior that I observed.

The decay is the thing that prevents this, by delaying the removal of the flow state...

xnoreq commented 3 years ago

How would bulk_flow_count go above CAKE_QUEUES?

... You could have just pointed me to cake_hash and how it caps src/dsthost_bulk_flow_count by decrementing those counters in case of hash collisions with bulk queues. Which also seems to imply that a host with 1x CAKE_QUEUES flows is not treated differently from a host with e.g. 5x CAKE_QUEUES flows. Though as cake seems to be designed for (powerful) home routers this probably doesn't matter.

Why would you combine ingress mode and ACK filtering? That makes no sense...

I don't and I don't know. But it is implemented partly. Why is it in the code like that?

The ACK filter only drops pure ACKs; those are not GSO packets...

Of course, thanks.

Yes, they are. The decay is the thing that prevents this, by delaying the removal of the flow state...

Are you talking about CAKE_SET_DECAYING here? In cake_enqueue, such decaying flows are treated like new (sparse) flows and their deficit is reset. That's also something I was asking about earlier. I see that when the flow is set to decaying, the goto begin restarts dequeuing, so another flow might get a packet submitted or dropped, but if a new packet arrives for the decaying flow in the meantime then it's again processed first. Since the deficit was reset the flow can now also send a full quantum of bytes again. Doesn't this give an unfair advantage to such flows?

tohojo commented 3 years ago

xnoreq notifications@github.com writes:

How would bulk_flow_count go above CAKE_QUEUES?

... You could have just pointed me to cake_hash and how it caps src/dsthost_bulk_flow_count by decrementing those counters in case of hash collisions with bulk queues.

Well, you seemed perfectly capable of looking at the code yourself :)

Which also seems to imply that a host with 1x CAKE_QUEUES flows is not treated differently from a host with e.g. 5x CAKE_QUEUES flows. Though as cake seems to be designed for (powerful) home routers this probably doesn't matter.

Erm, not sure what you mean here...

Why would you combine ingress mode and ACK filtering? That makes no sense...

I don't and I don't know. But it is implemented partly. Why is it in the code like that?

Because we didn't think that anyone would turn on both ingress mode and ack filter at the same time, I suppose?

Yes, they are. The decay is the thing that prevents this, by delaying the removal of the flow state...

Are you talking about CAKE_SET_DECAYING here? In cake_enqueue, such decaying flows are treated like new (sparse) flows and their deficit is reset. That's also something I was asking about earlier. I see that when the flow is set to decaying, the goto begin restarts dequeuing, so another flow might get a packet submitted or dropped, but if a new packet arrives for the decaying flow in the meantime then it's again processed first. Since the deficit was reset the flow can now also send a full quantum of bytes again. Doesn't this give an unfair advantage to such flows?

Yeah, this one is a bit subtle: The thing that prevents starvation is that the state is not cleared immediately; once a queue runs empty, it'll be put into the set of decaying flows, and stay there until its codel 'count' has gone back down to zero. So if another packet for that flow arrives before that happens, it'll keep the same deficit and won't be able to gain any advantage.

xnoreq commented 3 years ago

Are you talking about CAKE_SET_DECAYING here? In cake_enqueue, such decaying flows are treated like new (sparse) flows and their deficit is reset. That's also something I was asking about earlier. I see that when the flow is set to decaying, the goto begin restarts dequeuing, so another flow might get a packet submitted or dropped, but if a new packet arrives for the decaying flow in the meantime then it's again processed first. Since the deficit was reset the flow can now also send a full quantum of bytes again. Doesn't this give an unfair advantage to such flows?

Yeah, this one is a bit subtle: The thing that prevents starvation is that the state is not cleared immediately; once a queue runs empty, it'll be put into the set of decaying flows, and stay there until its codel 'count' has gone back down to zero. So if another packet for that flow arrives before that happens, it'll keep the same deficit and won't be able to gain any advantage.

I understand what you're saying but I don't see this reflected in the code. Let me repeat: When a new packet arrives for a decaying flow, it is treated almost exactly like a new flow. 1) it's added/moved to the end of new_flows 2) its deficit is reset

1 & 2 mean that in cake_dequeue it gets into the AQM part first, before any other old_flows. If the flow then sends a packet and keeps decaying the whole process repeats. As I wrote before, after the flow is set to decaying, there is a goto that will result in another flow being processed, but this is potentially another such "new" flow

I took openwrt's sch_cake and did a couple of modifications to the ingress mode. I probably will upload this to a new repo after more testing on a home router. With the modifications I can run in besteffort and without any iptables rules (to classify the traffic into different tins) which significantly reduces CPU load, actually overload, to a small fraction. Plus a couple of other hacks.

One thing that is somewhat of a mystery to me is q->failsafe_next_packet. I see that this was introduced with ingress mode, to help in scenarios where a lot of packets are dropped? Wouldn't this just advance time_next_packet a bit further into the future? Could you please explain in which situation this failsafe operates? Thanks.

tohojo commented 3 years ago

I understand what you're saying but I don't see this reflected in the code. Let me repeat: When a new packet arrives for a decaying flow, it is treated almost exactly like a new flow.

1. it's added/moved to the end of `new_flows`

2. its deficit is reset

1 & 2 mean that in cake_dequeue it gets into the AQM part first, before any other old_flows. If the flow then sends a packet and keeps decaying the whole process repeats. As I wrote before, after the flow is set to decaying, there is a goto that will result in another flow being processed, but this is potentially another such "new" flow

Huh, I think it's actually worse than that: By assumption, a sparse flow doesn't trigger the AQM at all. So a small packet (size < quantum) will get enqueued, flow will go to new_flows, and on dequeue that packet will get dequeued immediately, but the flow will stay at the head of new_queues. On the next dequeue, the flow will actually be empty, so cake_dequeue_one() will return NULL, and the logic will hit the else branch:

https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2263-L2284

This will take the flow straight back to empty, causing a new arriving packet to move it back to the front. So a sparse flow that times its packets right (to arrive just after the flow state expires), can potentially cause persistent starvation of the other flows.

@chromi any reason why the above is wrong?

As for the decaying mode, I'm not so worried about that clearing the deficit on re-activation. I mean, it's a bit weird (why are we keeping all the other state but clearing the deficit?), but at least it won't cause starvation because that flow has already been through the old flows rotation at least once...

I took openwrt's sch_cake and did a couple of modifications to the ingress mode. I probably will upload this to a new repo after more testing on a home router. With the modifications I can run in besteffort and without any iptables rules (to classify the traffic into different tins) which significantly reduces CPU load, actually overload, to a small fraction. Plus a couple of other hacks.

Feel free to open pull requests with any changes you want to share.

One thing that is somewhat of a mystery to me is q->failsafe_next_packet. I see that this was introduced with ingress mode, to help in scenarios where a lot of packets are dropped? Wouldn't this just advance time_next_packet a bit further into the future? Could you please explain in which situation this failsafe operates? Thanks.

I think this was added because of some potential starvation issues. @chromi ?

xnoreq commented 3 years ago

@tohojo I think that's correct and that it's possible to construct malicious traffic patterns that cause total starvation, but I was just concerned with a practical case in my application where I saw a low bandwidth flow (e.g. from an online game with periodic updates in form of small UDP packets) dropping packets even though total bandwidth was not even remotely exhausted.

I do have some other hacks like differentiation of bulk or sparse decaying flows, doing deficit accounting asap (in enqueue) in ingress mode, not resetting the deficit etc. running on an openwrt router, but as I'm not doing any scientifc measurements (wouldn't know where to start) I cannot make concrete statements about the effects of any of those. All I can say is "works better for me on my home router". I'll maybe upload something later in the evening.

I guess that the deficit (counter) is cleared because that's what the DRR algorithm does when a queue is empty. Yeah, I know, that's another one of those "because" answers.

chromi commented 3 years ago

On 10 Nov, 2020, at 11:52 am, Toke Høiland-Jørgensen notifications@github.com wrote:

I understand what you're saying but I don't see this reflected in the code. Let me repeat: When a new packet arrives for a decaying flow, it is treated almost exactly like a new flow.

  1. it's added/moved to the end of new_flows

  2. its deficit is reset

1 & 2 mean that in cake_dequeue it gets into the AQM part first, before any other old_flows. If the flow then sends a packet and keeps decaying the whole process repeats. As I wrote before, after the flow is set to decaying, there is a goto that will result in another flow being processed, but this is potentially another such "new" flow

Huh, I think it's actually worse than that: By assumption, a sparse flow doesn't trigger the AQM at all. So a small packet (size < quantum) will get enqueued, flow will go to new_flows, and on dequeue that packet will get dequeued immediately, but the flow will stay at the head of new_queues. On the next dequeue, the flow will actually be empty, so cake_dequeue_one() will return NULL, and the logic will hit the else branch:

https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2263-L2284

This will take the flow straight back to empty, causing a new arriving packet to move it back to the front. So a sparse flow that times its packets right (to arrive just after the flow state expires), can potentially cause persistent starvation of the other flows.

@chromi any reason why the above is wrong?

If the there is time for the state to expire between packets, then the flow is sparse, and it is appropriate to prioritise it. The scenario you describe would actually require multiple flows to each have their timing fall in a pretty small window of opportunity, if it exists at all.

I'll freely admit that the logic here is more complex and harder to follow than I'd like. However, it's in a state that works well in practice, and which would requite a clean-slate rework to improve appreciably from this point without risking counter-intuitive effects. I would have to prototype that elsewhere in order to establish that a new design worked at least as well in a wide variety of conditions.

As for the decaying mode, I'm not so worried about that clearing the deficit on re-activation. I mean, it's a bit weird (why are we keeping all the other state but clearing the deficit?), but at least it won't cause starvation because that flow has already been through the old flows rotation at least once...

The purpose of the "decaying" state is to ensure that COBALT returns to a quiescent state soon after a flow goes away, rather than requiring more work when a new flow starts up in the same slot. It tries to take time that would normally be used to process bulk flows.

I took openwrt's sch_cake and did a couple of modifications to the ingress mode. I probably will upload this to a new repo after more testing on a home router. With the modifications I can run in besteffort and without any iptables rules (to classify the traffic into different tins) which significantly reduces CPU load, actually overload, to a small fraction. Plus a couple of other hacks.

Feel free to open pull requests with any changes you want to share.

One thing that is somewhat of a mystery to me is q->failsafe_next_packet. I see that this was introduced with ingress mode, to help in scenarios where a lot of packets are dropped? Wouldn't this just advance time_next_packet a bit further into the future? Could you please explain in which situation this failsafe operates? Thanks.

I think this was added because of some potential starvation issues. @chromi ?

Yes, without this feature, it's possible to get into a state where the AQM is dropping all the packets and making no actual deliveries. That can't happen without ingress mode, because normally a fresh packet would be chosen to fill the same downstream timeslot. But in ingress mode, the assumption is that a packet in the queue has already occupied some upstream timeslot, which cannot be taken back.

So the failsafe is just that - a way to keep some traffic moving in the case where the above might otherwise occur. NB: with ECN you can have congestion control without dropping packets, but there's still a lot of traffic that doesn't understand ECN yet.

xnoreq commented 3 years ago

Regarding the next_packet timestamps in ingress mode: As far as I understand the global time_next_packet is making a step proportional to the bandwidth used (by transmitting or dropping - it's the same in ingress mode). As long as something is queued up, it is not reset or synchronized to the current time. As this is used for scheduling, it should result in dequeuing at a rate that matches the configured global rate.

failsafe_next_packet is making a larger (1.5x) step, but only when packets are transmitted. So if very little or nothing is dropped for a while then failsafe_next_packet will be ahead of time_next_packet. Theoretically this advance could grow to hundreds of ms.

On the other hand, if packets are mostly dropped then time_next_packet will happily step forward, which would mean nothing would get dequeued (transmitted) for a potentially long timespan, but failsafe_next_packet will only make a single step on the final transmission in cake_dequeue.

As min(time_next_packet, failsafe_next_packet) is used for scheduling, this supposedly fixes the problem of nothing getting dequeued for a potentially long timespan.

The issue I see here is that if the situation changes from mostly transmitting for a while to dropping a lot of packets. If failsafe was ahead by a lot (let's say tens of ms) and we start dropping lots of packets then you are exactly in the same situation that this is supposed to fix. min(time_next_packet, failsafe_next_packet) could be tens of ms ahead.

Assuming my understanding of all of this is correct, why is failsafe_next_packet not simply set to now + some max permissible inactivity duration?

tohojo commented 3 years ago

If the there is time for the state to expire between packets, then the flow is sparse, and it is appropriate to prioritise it. The scenario you describe would actually require multiple flows to each have their timing fall in a pretty small window of opportunity, if it exists at all.

I'll freely admit that the logic here is more complex and harder to follow than I'd like. However, it's in a state that works well in practice, and which would requite a clean-slate rework to improve appreciably from this point without risking counter-intuitive effects. I would have to prototype that elsewhere in order to establish that a new design worked at least as well in a wide variety of conditions.

I think something like this patch would help on the starvation issue (which I think don't is quite as benign as you make it sound; there's a reason fq_codel safeguards against this):

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 7d37638ee1c7..f79ed99a5166 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -2129,11 +2129,17 @@ static struct sk_buff *cake_dequeue(struct Qdisc *sch)
                                        b->decaying_flow_count++;
                                }
                                flow->set = CAKE_SET_DECAYING;
+                       } else if (flow->set == CAKE_SET_SPARSE) {
+                               /* sparse flows need a round on the old_flows
+                                * list before removal, to avoid the possibility
+                                * of starvation
+                                */
+                               list_move_tail(&flow->flowchain, &b->old_flows);
+                               flow->set = CAKE_SET_SPARSE_WAIT;
                        } else {

Will test this and submit a proper patch :)

xnoreq commented 3 years ago

@tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk rotation". If a packet is enqueued for this flow after hitting your code then it will be categorized as a BULK flow. I'm not sure that's what you want for a sparse flow that potentially didn't even enter the bulk rotation.

tohojo commented 3 years ago

xnoreq notifications@github.com writes:

@tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk rotation". If a packet is enqueued for this flow after hitting your code then it will be categorized as a BULK flow. I'm not sure that's what you want for a sparse flow that potentially didn't even enter the bulk rotation.

Yeah, that is what we want: if a packet arrives for that flow while it is going through the rotation, then it will continue being scheduled as a bulk flow, because then it wasn't really sparse.

BTW, none of all of the above explains your original issue with packet drops on the sparse flow, so I'll be interested to know which of your changes actually fixed that... Feel free to just open a pull request with a bunch of commits, and we can sort through them and pick out those which appear to warrant more testing :)

chromi commented 3 years ago

On 11 Nov, 2020, at 1:06 am, Toke Høiland-Jørgensen notifications@github.com wrote:

@tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk rotation". If a packet is enqueued for this flow after hitting your code then it will be categorized as a BULK flow. I'm not sure that's what you want for a sparse flow that potentially didn't even enter the bulk rotation.

Yeah, that is what we want: if a packet arrives for that flow while it is going through the rotation, then it will continue being scheduled as a bulk flow, because then it wasn't really sparse.

Actually, no. For that to be true, you need the additional condition that the deficit has expired. In DRR++, that is what moves a flow from sparse to bulk status. So your suggested fix materially makes the sparseness condition more stringent than intended.

This is exactly the sort of unintended effect that I warned against earlier.

tohojo commented 3 years ago

Jonathan Morton notifications@github.com writes:

On 11 Nov, 2020, at 1:06 am, Toke Høiland-Jørgensen notifications@github.com wrote:

@tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk rotation". If a packet is enqueued for this flow after hitting your code then it will be categorized as a BULK flow. I'm not sure that's what you want for a sparse flow that potentially didn't even enter the bulk rotation.

Yeah, that is what we want: if a packet arrives for that flow while it is going through the rotation, then it will continue being scheduled as a bulk flow, because then it wasn't really sparse.

Actually, no. For that to be true, you need the additional condition that the deficit has expired. In DRR++, that is what moves a flow from sparse to bulk status. So your suggested fix materially makes the sparseness condition more stringent than intended.

That would only impact sparse flows that send several tiny packets with a combined size less than the quantum within one rotation. Which I suppose could be a concern, actually, but it would also be fairly straight-forward to fix, I think: Just re-promote the flows out from the old queue rotation on enqueue if there's still a positive quantum.

Whereas with the existing code you can use careful timing to get half of the link bandwidth regardless of the number of flows, like so:

Assume N bulk flows all using quantum-size packets, constantly backlogged. Flow S packet comes in, goes into the sparse rotation, is immediately dequeued. On the next dequeue, the state for flow S is cleared, and a bulk packet is dequeued and will start transmission. While that packet is dequeued, flow S sends another packet, which will be put to the head of the queue and dequeued next.

So with this, flow S can time its packets so that it is serviced on every other dequeue, which will give it a bandwidth share of s/n where s is the sparse flow packet size and n is the bulk flow packet size - regardless of the number of bulk flows.

The only pieces of information needed to do this is the wire time of its own packet + that of a bulk flow. Which really means all it needs to know is the physical link (or shaper) rate; not too difficult to guess. And especially at low rates, the timing doesn't even need to be that precise (ms granularity).

Going to try and prototype a packet generator that can exploit the above; if it's as straight forward as I think, we really need to fix it :)

This is exactly the sort of unintended effect that I warned against earlier.

It is not 'unintended', it's a deliberate tradeoff; what I proposed above is literally what fq_codel does to protect against starvation:

https://elixir.bootlin.com/linux/latest/source/net/sched/sch_fq_codel.c#L308

As I said above, though, I don't think it's too hard to fix even this issue...

xnoreq commented 3 years ago

My remark wasn't about moving the flow to the end of old_flows, that's fine. It was only about reusing CAKE_SET_SPARSE_WAIT which is not the right status. I think that's what Toke referred to as well.

It's simply to fix. Just add another state and on enqueue add a case for this new state that just sets the state back to SPARSE. Also need to add this case to where empty queues are deleted. I've added this to my hacks and it seems to work fine.

xnoreq commented 3 years ago

Jonathan Morton notifications@github.com writes: Whereas with the existing code you can use careful timing to get half of the link bandwidth regardless of the number of flows, like so:

Assume N bulk flows all using quantum-size packets, constantly backlogged. Flow S packet comes in, goes into the sparse rotation, is immediately dequeued. On the next dequeue, the state for flow S is cleared, and a bulk packet is dequeued and will start transmission. While that packet is dequeued, flow S sends another packet, which will be put to the head of the queue and dequeued next.

So with this, flow S can time its packets so that it is serviced on every other dequeue, which will give it a bandwidth share of s/n where s is the sparse flow packet size and n is the bulk flow packet size - regardless of the number of bulk flows.

Yeah, as mentioned earlier I think it's also easy to starve old flows: a simple loop over some port or IP range, send a packet to each IP:port at a high enough rate. Repeat. That should do it.

lynxthecat commented 2 years ago

Was the fix ever implemented?

BTW is there an easy way to see which packets CAKE actually drops? I'd love to be able to 'tcpdump' the dropped packets.

tohojo commented 2 years ago

Was the fix ever implemented?

Hmm, no, guess not :/

BTW is there an easy way to see which packets CAKE actually drops? I'd love to be able to 'tcpdump' the dropped packets.

There's a kfree_skb trace point you can attach to. Theoretically that should allow you to inspect the packet contents as well, but you may have to write some BPF code to do this. I haven't played around with it much, but it looks like the skbtrace utility here might be able to do this (but not sure if it's feasible to get that to run on an openwrt router): https://github.com/yandex-cloud/skbtrace