apache / trafficserver

Apache Traffic Server™ is a fast, scalable and extensible HTTP/1.1 and HTTP/2 compliant caching proxy server.
https://trafficserver.apache.org/
Apache License 2.0
1.82k stars 807 forks source link

negative_revalidating_lifetime seems not respected #7425

Closed lukenowak closed 3 years ago

lukenowak commented 3 years ago

Having configuration:

CONFIG proxy.config.http.negative_revalidating_enabled INT 1
CONFIG proxy.config.http.negative_revalidating_lifetime INT 10

I have such scenario:

It happens with 7.1.11, 8.1.0 and 8.1.1.

Note: Same happens if origin is just down, where I'd expect 5xx code from trafficserver after waiting for enough of time.

lukenowak commented 3 years ago

With:

CONFIG proxy.config.http.negative_revalidating_enabled INT 0

I start to got 503 replies after waiting for max-age time to pass, so it seems generally the behaviour is impacted by enabling that option.

fdiary commented 3 years ago

@lukenowak According to the source code, ATS can return a stale cache if is_stale_cache_response_returnable() returns true, that refers proxy.config.http.cache.max_stale_age but does NOT refer proxy.config.http.negative_revalidating_lifetime.

https://github.com/apache/trafficserver/blob/8.1.1/proxy/http/HttpTransact.cc#L5741

But I am not sure the difference between the purposes of these two variables.

masaori335 commented 3 years ago

Descriptions of these configs seem ambiguous when and which config should be used.

As @fdiary pointed out, max_stale_age is used to check the age of cached contents. https://github.com/apache/trafficserver/blob/08fe521a3974a05b01545c68fe1bcd5162dd4fc4/proxy/http/HttpTransact.cc#L5925 OTOH, negative_revalidating_lifetime is only used to set the Expires header of stale contents. https://github.com/apache/trafficserver/blob/08fe521a3974a05b01545c68fe1bcd5162dd4fc4/proxy/http/HttpTransact.cc#L4350 If we take the doc literary, it seems like we should refer to negative_revalidating_lifetime too when the negative_revalidating is enabled.

lukenowak commented 3 years ago

I have some cases (note: I am using version patched with https://github.com/apache/trafficserver/pull/7422)

Cases

Case 1 - origin returns 502, negative_revalidating_enabled = 0

negative_revalidating_enabled = 0 negative_revalidating_lifetime = 0 max_stale_age = 30

  1. configure origin to reply 200 with max-age = 15
  2. GET results with 200 from origin
  3. sleep 2
  4. configure origin to reply 502 with max-age = 15
  5. GET results with 200 from origin
  6. sleep 15 (so it is 2 seconds more > max-age < max_stale_age)
  7. GET results with 502 from origin, I see response being updated as max-age passes (due to patch)

This case, TrafficServer does not simulate "stale-if-error", expected.

Setting negative_revalidating_lifetime to some value impacts nothing.

Case 2 - origin not available anymore, negative_revalidating_enabled = 0

negative_revalidating_enabled = 0 negative_revalidating_lifetime = 0 max_stale_age = 30

  1. configure origin to reply 200 with max-age = 15
  2. GET results with 200 from origin
  3. sleep 2
  4. origin stopped
  5. GET results with 200 from origin
  6. sleep 15 (so it is 2 seconds more > max-age < max_stale_age)
  7. GET results with 200 from ATS with Warning: 111 ApacheTrafficServer/8.1.1
  8. sleep 15+2 (so it is > max_stale_age)
  9. 502 from Apache Traffic Server

This case, TrafficServer simulates a bit "stale-if-error", but only when backend is down (replies nothing).

Setting negative_revalidating_lifetime to some value impacts nothing.

Case 3 - origin returns 502, negative_revalidating_enabled = 1

negative_revalidating_enabled = 1 negative_revalidating_lifetime = 0 max_stale_age = 30

  1. configure origin to reply 200 with max-age = 15
  2. GET results with 200 from origin
  3. sleep 2
  4. configure origin to reply 502 with max-age = 15
  5. GET results with 200 from TrafficServer
  6. sleep 15 (so it is 2 seconds more, > max-age < max_stale_age)
  7. GET results with 200 from TrafficServer
  8. sleep 15+2 (so it is > max_stale_age)
  9. GET results with 502 from the origin, I see response being updated as max-age passes (due to patch)

In this case, TrafficServer simulates "stale-if-error", expected. max_stale_age is the switch to control the time frame of the feature.

Setting negative_revalidating_lifetime to some value impacts case of step 7 with Expires header becoming Date + negative_revalidating_lifetime, but in step 9 there is no more Expires header.

Case 4 - origin not available anymore, negative_revalidating_enabled = 1

negative_revalidating_enabled = 1 negative_revalidating_lifetime = 0 max_stale_age = 30

  1. configure origin to reply 200 with max-age = 15
  2. GET results with 200 from origin
  3. sleep 2
  4. configure origin to reply 502 with max-age = 15
  5. GET results with 200 from TrafficServer
  6. sleep 15 (so it is 2 seconds more, > max-age < max_stale_age)
  7. GET results with 200 from ATS with Warning: 111 ApacheTrafficServer/8.1.1
  8. sleep 15+2 (so it is > max_stale_age)
  9. GET results with 502 from the origin, I see response being updated as max-age passes (due to patch)

In this case, TrafficServer simulates "stale-if-error", expected. max_stale_age is the switch to control the time frame of the feature.

Setting negative_revalidating_lifetime to some value impacts nothing

Conclusion

So currently the documentation (in records.config) could state:

Add something like "proxy.config.http.cache.max_stale_age can be used to configure how long the stale response will be given for available backend returning 5xx response."

"...during revalidation, by setting Expire: header to value of Date: + value of the option"

"...cannot be cached. Is enabled by proxy.config.http.negative_revalidating_enabled" Well, I am not sure about this, maybe such documentation shall be kept in cache-basics as some kind of scenario, anyway this is how I understood and configure TrafficServer "negative revalidation" now.

bryancall commented 3 years ago

@bneradt Can you please document this? Thank you.

bneradt commented 3 years ago

I'm working on this and adding an AuTest. In case this is helpful to others, I'll make some initial observations.

My guess at the initial intention of this code was that setting the Expires header, observed earlier in this ticket, would set the freshness lifetime to the desired value. This doesn't work however, for at least three reasons:

First, the Expires header is set to the value of negative_revalidating_lifetime in the future from the current time:

https://github.com/apache/trafficserver/blob/08fe521a3974a05b01545c68fe1bcd5162dd4fc4/proxy/http/HttpTransact.cc#L4350

If that actually was effective, then when proxy.config.http.negative_revalidating_enabled was enabled, every single cached response for a down server would always be considered fresh and served out of the cache because its Expires time would always be in the future. This is clearly not the intention. Rather, for this calculation to work, we'd want to set the Expires value to the time past the calculated age of the resource, not the offset from the current time.

The second reason this doesn't work is that, per RFC 7234 section 4.2.1, the max-age directive takes precedence over the Expires header field. Thus in all responses that has max-age, the Expires header would never even be inspected. Notice that @lukenowak 's examples all use max-age, and indeed most tests would because that's the easiest way to test freshness behavior. (Fixing this by setting the max-age directive, however, would not be a sufficient fix because of the following paragraph.)

Thirdly, by the time the Expires header is set in the code, we've already completed freshness calculations. Otherwise we wouldn't be in the code determining whether the stale response can be served. Recall that the negative revalidating feature influences behavior for stale cached responses in which the origin is unreachable. Thus at this point, setting the Expires header is too late to influence ATS's freshness calculation. We would have to set it then re-perform the freshness calculation. This is not done.

In order to make proxy.config.http.negative_revalidating_lifetime behave like it is documented, the three above issues would need to be addressed.

For the sake of completeness, if that were done, the intention is that things should work like this:

  1. A client makes a request to a server which responds with some cacheable positive response, such as a 200 OK.
  2. At a point in the future after the cached resource is stale, the client requests the resource.
  3. ATS notices that the resource is stale and attempts to revalidate it against the origin.
  4. The origin is either unreachable by ATS or it replies with some 5xx response.
  5. ATS notices that proxy.config.http.negative_revalidating_enabled is enabled.
  6. ATS recalculates the freshness of the resource using the adjusted freshness calculation as set in proxy.config.http.negative_revalidating_lifetime. If the resource is fresh enough given this adjustment, reply with it.
  7. If, after this adjusted freshness calculation, the object is still stale, then ATS will respect max_stale_age. If its staleness is less than max_stale_age it will reply with it, otherwise it will return a 5xx response.

For now, with the bug described in this ticket, step 6 is broken and never results in the object's freshness being recalculated. Thus a user can only influence negative revalidating responses via max_stale_age.

mlibbey commented 3 years ago

Thanks Brian -- fwiw, I tried a bash script a long while back https://github.com/apache/trafficserver/issues/3211 -- don't think it covered all your cases :)