Can we be smarter than 404s?

haudiobe commented 5 years ago

Gentlemen,

Following last week low latency meeting with in Amsterdam, the idea with this thread is to study how we can lessen the side effects of timing problems on the client-side on the service logs. Indeed, aggressive or misaligned players requesting segments in the future are generating a lot of 404 errors, which makes difficult for service providers to isolate real errors from noise in the logs. The origin might also indirectly be the source of the 404s in the case of chunked CMAF contents, if it's adding an extra buffer on top of the packager's AvailabilityTimeOffset.

Assuming that the origin has got some knowledge about the stream structure and can assess if a segment is coming in the future or will never come (case of a rogue request), I see a few options that we could leverage:

Basic solution : reuse the 412 Precondition Failed return code (https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.13) like Smooth Streaming was doing with some success. The return semantics is not appropriate for fine grain actions, but it can be cached by the CDN for a second (which is the minimum caching duration). Not optimal with chunked CMAF if the segment will be available in less than a second, but still efficient somehow in relieving pressure from the origin. A 'must-revalidate' cache-control header added by the origin can prevent the 412 to be served stale by the CDN.
Intermediate solution : use the 202 Accepted return code (https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.3) in combination with a Retry-After response header (https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.37) where we can specify the absolute date in milliseconds when the player will be able to request the segment. If this response is cached for a second by the CDN while the absolute time returned is less than one second in the future, we lose in efficiency from the player perspective which will need an additional retry to get the segment. If the response is not cached by the CDN, the origin will be somehow hammered. If a 'cache-control: no-cache' header is returned by the origin, the use of a complementary ETag response header (https://tools.ietf.org/html/rfc7232#section-2.3) can alleviate some of this load on the origin.
Hybrid solution :
- for all invalid requests, the origin returns a 404 (with a 'must-revalidate' cache-control header) which is cached for one second or more by the CDN
- for all requests valid in more than one second in the future, the origin returns a 412 (with a 'must-revalidate' cache-control header) which is cached for one second by the CDN
- for all requests valid in less than one second in the future, the origin returns a 202 with the exact millisecond availability date in the Retry-After response header, a millisecond-accurate ETag and a cache-control: no-cache header

The ideal would be to engage W3C in order to introduce millisecond precision in time formats, but I guess that this is a 10 years journey, and we need something on the short term :-)

Thoughts ?

TobbeEdgeware commented 5 years ago

The text above is a great piece from Nicolas Weil at AWS following up on the discussion from the joint DASH-IF/DVB meeting

The only thing I don’t get is how one can specify milliseconds in the intermediate solution.

nicoweilelemental commented 5 years ago

Response Headers: Retry-After: 2018-09-25T11:11:44.715Z Cache-Control: max-age=0, no-cache, no-store ETag: 1537873904715

We can also put the milliseconds date in the response body, on top of the 'Retry-After' header.

All requests should be coming back to the origin but the use of ETag could allow CDNs to do a lightweight revalidation. That's a point to verify.

poolec commented 5 years ago

For sub-second delays, what advantage do you see of this approach over and above having the origin just accept the request with a 200 and begin a chunked transfer, waiting for the first chunk to become available? A delay of less than a second is similar to the likely delay between chunks of a segment.

In either case (doing that, or sending a non-200 response) the origin still needs to know about the segment being requested, in the non-200 case to indicate when the request could be retried.

If a client requests a segment much earlier, this approach wouldn't work very well but I guess I'm not seeing a big problem for sub-second hold-ups.

wilaw commented 5 years ago

I find the proposals for caching something for one second to be quite fragile. They seem to harken back to Smooth streaming days where the segment duration was always 2s. Caching a 1s segment for 1s would be very detrimental to overall latency.

I would like to counter-propose that we do not invent new response codes and instead go with something simpler. At a given point in time, a segment is either available (200) or not available (404). We should keep this clear signaling but add in some timing information via response headers. Smart players and CDNs can then use this timing data to improve their functionality.

Origins have the option of holding a chunked-transfer response open if they know that the data will shortly be available (per Chris’s comment below). How long they choose to do this is a function of the segment duration (SD) and the origin ability to handle concurrent connections. I would suggest that SD/2 is a reasonable period to wait for data, but we should not enforce this and instead leave it up to the origin to configfure.

I’d also like us to move to a future where all segment responses (200 and 404) contain a standardized response header which indicates the earliest wall-clock time at which that segment would have been available at the origin. This has the following benefits:

For clients requesting too early, it tells them how much they should correct their timing model in order to correctly time the request for the segment.
For clients requesting too late, it tells them how much they could reduce their latency.
It allows us to map latency across the distribution chain, which is the first step in controlling and removing latency from the system.

A client starting playback could make a HEAD request for what it thinks is the latest available segment. It could use the response header to figure out the delta between its own timing and that of the origin and then make well-timed GET requests for all subsequent segments.

In addition, a CDN edge server would like to protect the origin from the flood of 404s. It could do this by using the timing data coming back in the first 404 response to intelligently adjust the TTL of the cached 404 response so that it advertises 404s without going back to the origin until the right moment at which the content will be available.

Cheers Will

From: Chris Poole notifications@github.com Reply-To: Dash-Industry-Forum/DASH-IF-IOP reply@reply.github.com Date: Tuesday, September 25, 2018 at 9:40 AM To: Dash-Industry-Forum/DASH-IF-IOP DASH-IF-IOP@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [Dash-Industry-Forum/DASH-IF-IOP] Can we be smarter than 404s? (#205)

For sub-second delays, what advantage do you see of this approach over and above having the origin just accept the request with a 200 and begin a chunked transfer, waiting for the first chunk to become available? A delay of less than a second is similar to the likely delay between chunks of a segment.

In either case (doing that, or sending a non-200 response) the origin still needs to know about the segment being requested, in the non-200 case to indicate when the request could be retried.

If a client requests a segment much earlier, this approach wouldn't work very well but I guess I'm not seeing a big problem for sub-second hold-ups.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Dash-2DIndustry-2DForum_DASH-2DIF-2DIOP_issues_205-23issuecomment-2D424413356&d=DwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=KkevKJerDHRF9WRs8nW8Ew&m=GH84Wgyp8hqyy4ofwJm19szIyKeimq-dFHkhDgHVJps&s=wHW9fcm6xz9eIWwfVZAF05nMlN9nOW-rGAJ10MlF2yo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AComCmjqo1HjCZs-5FWAwVnrBmagGGfnUyks5uelyJgaJpZM4W3fHJ&d=DwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=KkevKJerDHRF9WRs8nW8Ew&m=GH84Wgyp8hqyy4ofwJm19szIyKeimq-dFHkhDgHVJps&s=MxqF5xy73nrlSQ13nji5JL_zjmSnXuIGVrT6KvQn3Hw&e=.

nicoweilelemental commented 5 years ago

It makes sense, Will, but the initial problem statement is to get rid of all 404s generated at the edge by early requests. That's why the 202 was an interesting alternative. If a CDN can intelligently adjust a caching TTL on a 202 as it could do on a 404 (and cache with a millisecond granularity rather than seconds), then we're done.

wilaw commented 5 years ago

@nicoweilelemental - couple of comments:

The Retry-After and Date headers cannot carry millisecond precision. Per https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Date , it is a 2 digit integer.
I also want to challenge the notion that our goal should be to " .. get rid of all 404s generated at the edge by early requests". In correct operation, those 404s would not be there. They indicate a timing miss-match between origin and client. Therefore their presence is a signal that something is not correct in the system. Changing them to 202s if they are "close enough" is a bit like sweeping dust under the carpet and declaring the room to be clean.

nicoweilelemental commented 5 years ago

Not a problem, we can instead create a custom header like x-dash-retry with millisecond precision. That's not gonna break anything but still do the job. This is also how we would do with your interesting suggestion to " indicate the earliest wall-clock time at which that segment would have been available at the origin".
I agree that we shall aim for perfection but the fact is that we won't be able to fix all the dash player implementations in the wild. At the same time, from a backward compatibility standpoint, it's true that transforming most of the 404s to 202 will probably cause problems with existing implementations as the players won't understand what to do with a 202 augmented with a x-dash-retry header. The only solution to isolate false positive 404s would then be to filter it when logging at the edge.

Assuming we keep returning 404s, here is a recap of the ideas so far:

add a x-dash-retry header with milliseconds precision to 404s for early requests, so that the client can adjust the timining for the next request on a given segment
add a x-dash-originavailtime header with milliseconds precision to 200s, so that the client can adjust its general requests timing precision
suggest that origins shall respond with a 200 and keep the CDN connection open if the segment availtime is inferior or equal to half of the segment duration
suggest that CDNs shall dynamically adjust the TTL of the 404s depending on the x-dash-retry header value, in order to bring a millisecond precision to caching at the edge

haudiobe commented 5 years ago

Discussion during Live TF call On Headers:

likely only need the second one as it includes the first one. But should be added also to 404.
Can we add some guidelines how an origin may generate this information
Create a definition

On second aspects

should be formulated as recommendations
We continue the discussion on github

haudiobe commented 5 years ago

Discussion during Live TF call: -Complementary discussion in SVA, Ali tried to link, not sure we need to merge this at this stage -SVA: How should a player follow redirection - Does it map to our guidelines?

Continue discussion on our topic here

acbegen commented 5 years ago

Ori in the open caching working group is leading the discussion in SVA. I asked him to update this thread over here so we can avoid duplicate work and create a single guidelines document.

orifinkelman-zz commented 5 years ago

In the discussion held in HTTPbis (IETF), we have reached the following conclusions:

HTTP redirected requests, are new requests, they have a new URI which separated from the original URI.
Correct interpretation of 5.1 of rfc3986: A URI embedded or enclosed within another entity IS NOT the same as enclosing the object within an encapsulating, it is just the URI. As such, the playlist retrieved by the URI is not considered embedded or enclosed within the entity that referred to the URI.
As a result, the relevant section that applies is section 5.1.3 that dictates the the base URI for this case is the retrieval URI, and if there was a redirection, then the redirected URI. Therefore, when coming to resolve relative references within the MPD they should be using the final retrieval URI of the MPD as they base URI for relative reference resolution.
How does it apply for the BaseURL resolution ?

nicoweilelemental commented 5 years ago

Two different topics are blended on this page, I would suggest to move the HTTP redirection one to a different issue, if necessary.

As regards the initial topic, we came to the conclusion that we would need to add a guideline and a recommendation. It could look like:

1. Guideline When responding to a segment request with a 200 return code, an origin shall add a x-dash-originavailtime header with a value corresponding to the actual timestamp when the segment was actually available for consumption on the origin, using the [TBD] timestamp format with milliseconds precision. The clients could use this information afterwards to adjust their general requests timing precision for subsequent requests, and therefore avoid to generate further 404s because of wallclock misalignment.

Note : we have two options for the timestamp format, ISO and unix timestamp. We need to discuss if we allow both or only one of them. ISO example on https://time.akamai.com/?iso&ms : 2018-12-04T19:34:12.277Z Unix timestamp example on https://time.akamai.com/?ms : 1543952391.386

2. Recommendation In order to compensate [PROBLEM(s)] an origin shall respond with a 200 and keep the connection with the CDN open during half of the segment duration, even if the segment is not already available on the origin, instead of responding with a 404.

Note : I'm not fully convinced that half a segment duration is the right value to recommend. If the main objective is just to compensate wallclock misalignments, then the value may be quite high. If the objective is to compensate the variability of segments duration (which the current terminology seems to indicate), then this is a slightly different motivation and the value is correct. However, the side effect of this recommendation is that an attacker could generate a lot of segments requests and open many pending connections on the origin side. This security consideration shall be discussed before we finalize this recommendation.

poolec commented 5 years ago

Regarding the proposed guideline, I have some concerns. Two particular issues are:

If the segment is available at the origin at time T, it doesn't necessarily mean that the client will avoid a 404 if it requests it at time T because there could (and almost certainly would) be caching of 404 responses in the CDN. For example, with a 404 cache lifetime of one second, a request at origin right up to time T could generate a 404 and the client may see a 404 up to time T+1. Since the client doesn't know the 404 cache lifetime, signalling T to the client doesn't help: anything that is signalled would need to take that into account.
Just because one segment is published slightly before its availability time doesn't mean that other segments will be available early also. It could depend on the pattern of I or P frames in the encoding, the loading of servers, etc.

I think the client needs to base its request timing on the calculated segment availability time. Content providers will arrange this to be a time when clients can safely request the segments and they shouldn't request early.

If the client drifts, there is an issue. We have UTCTiming elements to help there but maybe more guidance on client implementation may be needed. But if there is clock drift, I can't immediately see how signalling an origin availability time on individual segments helps: it's still a UTC time that the client needs to interpret and if its clock is wrong, isn't it still going to do the wrong thing?

Regarding the recommendation, I think this kind of thing is good for low latency services, particularly with chunked responses. Considering the 404 caching issue above, I wonder if one part of the recommendation should address that specifically, e.g.:

To allow for caching of 404 responses at origin (normally necessary to protect the origin against clients requesting early) whilst not introducing additional latency, an origin server for a low latency service should respond with a 200 response at least 404_TO seconds prior to the published chunk or segment availability time, keeping the connection open until the payload becomes available, where 404_TO is the period of time for which an origin 404 response may be cached in the CDN.

Other similar recommendations could be written linked to variability in segment or chunk publication, but always written in terms of the availability time that the client is told about and specific properties that might cause the variability.

Dash-Industry-Forum / DASH-IF-IOP

Can we be smarter than 404s? #205