Dash-Industry-Forum / DASH-IF-IOP

DASH-IF Interoperability Points issue tracker and document source code
31 stars 7 forks source link

Can we be smarter than 404s? #205

Open haudiobe opened 5 years ago

haudiobe commented 5 years ago

Gentlemen,

Following last week low latency meeting with in Amsterdam, the idea with this thread is to study how we can lessen the side effects of timing problems on the client-side on the service logs. Indeed, aggressive or misaligned players requesting segments in the future are generating a lot of 404 errors, which makes difficult for service providers to isolate real errors from noise in the logs. The origin might also indirectly be the source of the 404s in the case of chunked CMAF contents, if it's adding an extra buffer on top of the packager's AvailabilityTimeOffset.

Assuming that the origin has got some knowledge about the stream structure and can assess if a segment is coming in the future or will never come (case of a rogue request), I see a few options that we could leverage:

The ideal would be to engage W3C in order to introduce millisecond precision in time formats, but I guess that this is a 10 years journey, and we need something on the short term :-)

Thoughts ?

TobbeEdgeware commented 5 years ago

The text above is a great piece from Nicolas Weil at AWS following up on the discussion from the joint DASH-IF/DVB meeting

The only thing I don’t get is how one can specify milliseconds in the intermediate solution.

nicoweilelemental commented 5 years ago

Response Headers: Retry-After: 2018-09-25T11:11:44.715Z Cache-Control: max-age=0, no-cache, no-store ETag: 1537873904715

We can also put the milliseconds date in the response body, on top of the 'Retry-After' header.

All requests should be coming back to the origin but the use of ETag could allow CDNs to do a lightweight revalidation. That's a point to verify.

poolec commented 5 years ago

For sub-second delays, what advantage do you see of this approach over and above having the origin just accept the request with a 200 and begin a chunked transfer, waiting for the first chunk to become available? A delay of less than a second is similar to the likely delay between chunks of a segment.

In either case (doing that, or sending a non-200 response) the origin still needs to know about the segment being requested, in the non-200 case to indicate when the request could be retried.

If a client requests a segment much earlier, this approach wouldn't work very well but I guess I'm not seeing a big problem for sub-second hold-ups.

wilaw commented 5 years ago

I find the proposals for caching something for one second to be quite fragile. They seem to harken back to Smooth streaming days where the segment duration was always 2s. Caching a 1s segment for 1s would be very detrimental to overall latency.

I would like to counter-propose that we do not invent new response codes and instead go with something simpler. At a given point in time, a segment is either available (200) or not available (404). We should keep this clear signaling but add in some timing information via response headers. Smart players and CDNs can then use this timing data to improve their functionality.

Origins have the option of holding a chunked-transfer response open if they know that the data will shortly be available (per Chris’s comment below). How long they choose to do this is a function of the segment duration (SD) and the origin ability to handle concurrent connections. I would suggest that SD/2 is a reasonable period to wait for data, but we should not enforce this and instead leave it up to the origin to configfure.

I’d also like us to move to a future where all segment responses (200 and 404) contain a standardized response header which indicates the earliest wall-clock time at which that segment would have been available at the origin. This has the following benefits:

  1. For clients requesting too early, it tells them how much they should correct their timing model in order to correctly time the request for the segment.
  2. For clients requesting too late, it tells them how much they could reduce their latency.
  3. It allows us to map latency across the distribution chain, which is the first step in controlling and removing latency from the system.

A client starting playback could make a HEAD request for what it thinks is the latest available segment. It could use the response header to figure out the delta between its own timing and that of the origin and then make well-timed GET requests for all subsequent segments.

In addition, a CDN edge server would like to protect the origin from the flood of 404s. It could do this by using the timing data coming back in the first 404 response to intelligently adjust the TTL of the cached 404 response so that it advertises 404s without going back to the origin until the right moment at which the content will be available.

Cheers Will

From: Chris Poole notifications@github.com Reply-To: Dash-Industry-Forum/DASH-IF-IOP reply@reply.github.com Date: Tuesday, September 25, 2018 at 9:40 AM To: Dash-Industry-Forum/DASH-IF-IOP DASH-IF-IOP@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [Dash-Industry-Forum/DASH-IF-IOP] Can we be smarter than 404s? (#205)

For sub-second delays, what advantage do you see of this approach over and above having the origin just accept the request with a 200 and begin a chunked transfer, waiting for the first chunk to become available? A delay of less than a second is similar to the likely delay between chunks of a segment.

In either case (doing that, or sending a non-200 response) the origin still needs to know about the segment being requested, in the non-200 case to indicate when the request could be retried.

If a client requests a segment much earlier, this approach wouldn't work very well but I guess I'm not seeing a big problem for sub-second hold-ups.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Dash-2DIndustry-2DForum_DASH-2DIF-2DIOP_issues_205-23issuecomment-2D424413356&d=DwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=KkevKJerDHRF9WRs8nW8Ew&m=GH84Wgyp8hqyy4ofwJm19szIyKeimq-dFHkhDgHVJps&s=wHW9fcm6xz9eIWwfVZAF05nMlN9nOW-rGAJ10MlF2yo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AComCmjqo1HjCZs-5FWAwVnrBmagGGfnUyks5uelyJgaJpZM4W3fHJ&d=DwMFaQ&c=96ZbZZcaMF4w0F4jpN6LZg&r=KkevKJerDHRF9WRs8nW8Ew&m=GH84Wgyp8hqyy4ofwJm19szIyKeimq-dFHkhDgHVJps&s=MxqF5xy73nrlSQ13nji5JL_zjmSnXuIGVrT6KvQn3Hw&e=.

nicoweilelemental commented 5 years ago

It makes sense, Will, but the initial problem statement is to get rid of all 404s generated at the edge by early requests. That's why the 202 was an interesting alternative. If a CDN can intelligently adjust a caching TTL on a 202 as it could do on a 404 (and cache with a millisecond granularity rather than seconds), then we're done.

wilaw commented 5 years ago

@nicoweilelemental - couple of comments:

  1. The Retry-After and Date headers cannot carry millisecond precision. Per https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Date , it is a 2 digit integer.
  2. I also want to challenge the notion that our goal should be to " .. get rid of all 404s generated at the edge by early requests". In correct operation, those 404s would not be there. They indicate a timing miss-match between origin and client. Therefore their presence is a signal that something is not correct in the system. Changing them to 202s if they are "close enough" is a bit like sweeping dust under the carpet and declaring the room to be clean.
nicoweilelemental commented 5 years ago
  1. Not a problem, we can instead create a custom header like x-dash-retry with millisecond precision. That's not gonna break anything but still do the job. This is also how we would do with your interesting suggestion to " indicate the earliest wall-clock time at which that segment would have been available at the origin".

  2. I agree that we shall aim for perfection but the fact is that we won't be able to fix all the dash player implementations in the wild. At the same time, from a backward compatibility standpoint, it's true that transforming most of the 404s to 202 will probably cause problems with existing implementations as the players won't understand what to do with a 202 augmented with a x-dash-retry header. The only solution to isolate false positive 404s would then be to filter it when logging at the edge.

Assuming we keep returning 404s, here is a recap of the ideas so far:

haudiobe commented 5 years ago

Discussion during Live TF call On Headers:

On second aspects

haudiobe commented 5 years ago

Discussion during Live TF call: -Complementary discussion in SVA, Ali tried to link, not sure we need to merge this at this stage -SVA: How should a player follow redirection - Does it map to our guidelines?

acbegen commented 5 years ago

Ori in the open caching working group is leading the discussion in SVA. I asked him to update this thread over here so we can avoid duplicate work and create a single guidelines document.

orifinkelman-zz commented 5 years ago

In the discussion held in HTTPbis (IETF), we have reached the following conclusions:

nicoweilelemental commented 5 years ago

Two different topics are blended on this page, I would suggest to move the HTTP redirection one to a different issue, if necessary.

As regards the initial topic, we came to the conclusion that we would need to add a guideline and a recommendation. It could look like:

1. Guideline When responding to a segment request with a 200 return code, an origin shall add a x-dash-originavailtime header with a value corresponding to the actual timestamp when the segment was actually available for consumption on the origin, using the [TBD] timestamp format with milliseconds precision. The clients could use this information afterwards to adjust their general requests timing precision for subsequent requests, and therefore avoid to generate further 404s because of wallclock misalignment.

Note : we have two options for the timestamp format, ISO and unix timestamp. We need to discuss if we allow both or only one of them. ISO example on https://time.akamai.com/?iso&ms : 2018-12-04T19:34:12.277Z Unix timestamp example on https://time.akamai.com/?ms : 1543952391.386

2. Recommendation In order to compensate [PROBLEM(s)] an origin shall respond with a 200 and keep the connection with the CDN open during half of the segment duration, even if the segment is not already available on the origin, instead of responding with a 404.

Note : I'm not fully convinced that half a segment duration is the right value to recommend. If the main objective is just to compensate wallclock misalignments, then the value may be quite high. If the objective is to compensate the variability of segments duration (which the current terminology seems to indicate), then this is a slightly different motivation and the value is correct. However, the side effect of this recommendation is that an attacker could generate a lot of segments requests and open many pending connections on the origin side. This security consideration shall be discussed before we finalize this recommendation.

poolec commented 5 years ago

Regarding the proposed guideline, I have some concerns. Two particular issues are:

I think the client needs to base its request timing on the calculated segment availability time. Content providers will arrange this to be a time when clients can safely request the segments and they shouldn't request early.

If the client drifts, there is an issue. We have UTCTiming elements to help there but maybe more guidance on client implementation may be needed. But if there is clock drift, I can't immediately see how signalling an origin availability time on individual segments helps: it's still a UTC time that the client needs to interpret and if its clock is wrong, isn't it still going to do the wrong thing?

Regarding the recommendation, I think this kind of thing is good for low latency services, particularly with chunked responses. Considering the 404 caching issue above, I wonder if one part of the recommendation should address that specifically, e.g.:

To allow for caching of 404 responses at origin (normally necessary to protect the origin against clients requesting early) whilst not introducing additional latency, an origin server for a low latency service should respond with a 200 response at least 404_TO seconds prior to the published chunk or segment availability time, keeping the connection open until the payload becomes available, where 404_TO is the period of time for which an origin 404 response may be cached in the CDN.

Other similar recommendations could be written linked to variability in segment or chunk publication, but always written in terms of the availability time that the client is told about and specific properties that might cause the variability.