Support cached data clean up for published names

suhasHere commented 2 years ago

This probably needs the following

Support protocol api that will indicate to caches that publisher explicitly closes the connection
Have such indicate to cleanup names and associated cached data in the relay

huitema commented 2 years ago

What happen if a connection breaks, and the publisher a few seconds later establishes a new one?

We can partially mitigate that with an "explicit close" message -- indeed, the current FIN message has very much that meaning. If the publisher has sent a FIN message, and then the publisher closes the connection, the relays will know that there is nothing more to receive. They could add a "time of fin" property, and then purge caches when that time is too old.

Of course, there is no explicit support for re-establishing publisher connections in the current code. That would be a separate issue. As long as there is no such support, the loss of the connection is an implicit "FIN", and can be treated as such. Once we have such support, the loss of the connection should start another timer, long enough to wait for the new connection. If the connection does not happen.

In a full system, we might also see the client setup the new connection through a different relay. The previous distribution tree and the new one will intersect at some common relay, at worst at the origin. The distribution tree should be updated, etc. That too would be a separate issue.

So maybe we should:

Treat the FIN message as a permission to clean the cache once all messages have been propagated.
Treat the loss of the feeder connection as a starting a longer timer, and then treat it as if FIN had been received.
If there are active readers, and if the FIN message has not been sent to these readers, send it before cleaning the cache.
Maybe, add a "close reason" attribute to the FIN message. (Could be a separate issue.)

suhasHere commented 2 years ago

Agree with all the points .. how about we group it this way ?

Publisher sending FIN message or closing the source context, along with timer value (can be zero as well, implying immediate cleanup) -> will start a timer and clears the caches after the timer expiry
On detecting the connection loss, - you proposal seems totally fine. Also if we allow for the publisher to reappear before the timer expires. [ This is assuming that there is authz complete and for now we can skip over it ]

Also agree that having a close reason will be handy for the subscribers to know what happened.

huitema commented 2 years ago

Do we want the client to set the timer for the network? I am a bit concerned about the increased attack surface. What if the client specified very long timers and cause exhaustion of the relay resource? Logically, I would prefer these parameters to come down from the origin. Something like:

1) Client POST origin/example.video 2) Relay receives that, open its own POST to the origin. 3) Origin sends ACCEPT to the relay, specifies timer, etc. 4) Relay sends ACCEPT to the client, notify client with value of reconnect timer. 5) media flows from client to relay 6) relay caches the media, forwards to origin 7) origin caches the media.

In a production version (not a prototype), the POST would include a security token that is passed all the way to the client to verify, and the relay would not accept the media from the client without that.

Of course, this is not quite what is implemented today. The relay does not wait for an origin response to start accepting the media, there is no lifetime parameter yet. So we don't have a real negotiation. I would just use a constant for now.

suhasHere commented 2 years ago

I should have been clear above, what the client provides is an hint and it can always be overriden and we need a max-lifetime constant that should be very similar to what we do in shallow caching today. A Client's value can never go beyond that.

On the security and authz, yes those details wil need to be production. The QuicR spec does touch upon how and when caches are allowed to accept the publish data : https://www.ietf.org/id/draft-jennings-moq-quicr-proto-00.html#name-publish_intent-and-publish_

However, there are few more details needs to be worked out as we build the prototype.

TimEvens commented 1 year ago

From a media perspective, we run into many challenges with IP mobility, where both producer (sender) and subscriber (receiver) are changing networks. Media today handles this via RTP and IP mobility (handled by the client and servers at the app level). IMO, we need something to resume an existing session (subscriber and/or producer) via a different IP/port/etc. flows. For example, switching from Verizon 5g to WIFI. Normally this involves auth and a session ID to facilitate transitioning to different IP flows. A connection level FIN will likely be after the network change (e.g., Verizon 5g --> Wifi). The client (producer or subscriber) may not be aware of the change till it's already happened. A graceful FIN may not be seen. Instead, it'll be a reconnect to resume, if at all. If we don't have a session ID, the reconnect will result in an additional cache and/or conflict with previous flow(s).

I'm leaning towards producer cache expiring purely based on TTL of messages or just FIFO and queue length. The caches can be updated real-time with the active session ID that has been authorized. A producer on reconnect would of course have to authz before it can send anything, otherwise hello DoS. Circular buffer queues are great here because it reduces the requirement to have garbage maintenance to purge out old data in cache.

Subscribers are a bit different. Subscriptions could be enhanced to be faster with cached data based on previous session. For example, subscriber changes network providers. It reconnects and authorizes its flow via the same session id. That session then inherits all previous subscriptions and resumes where it left off, unless it's greater than subscription max replay messages value (e.g., tail number). This will directly improve mobility by not requiring re-subscription while also keeping track of session ID cache position. Again, if the cache position for a subscriber is larger than subscribed tail back value, it would be moved forwarded and messages between disconnect and reconnect will be dropped to that subscriber. For example, say I'm on Verizon 5g and then lose signal for 5 minutes and reconnect using Comcast. I can resume my session, but the replay of 5 minutes is too great for some of the cached named flows. The flows that are too great will simply replay up to tail number. Not all names/queues/streams are equal, so some will still be able to resume the past 5 minutes. This is a bit tricky where time-sync between names/flows are not maintained. Many folks are of the opinion that it's too much for the relay to be involved with application time-sync between flows.. Instead leave it to the apps to deal with it and have the relays be super high performing with shallow cache.

Deep cache, or even persistent cache, can be offloaded.

huitema commented 1 year ago

@TimEvens the prototype pretty much manages the cache based on TTL. Group of blocks are removed from the cache some time T after they were received. That time was set to 2 minutes in the previous builds, but we just changed that to 10 seconds, which should improve the memory footprint a bit.

The decision to purge the cache or not is per media source -- a flag set when publishing the media. This is clearly a prototype thing: we want to be able to experiment with a variety of options. In "production", I would expect that to have a different effect:

if a media is "real time", purge media frames based on TTL
if a media is streaming, purge only if the memory is running short. In that case, keep a short cache for serving current connections, but stop advertising that cache as available for new connections. We might consider a second level persistent cache, but for a prototype that's complicated.

The cache purge is per-GOB. That may mean keeping several seconds of video, which requires memory. We do that because if we restart video from arbitrary points, users will likely see a mosaic of squares until the next GOB. We may want to purge all frames based on a shorter TTL, but if we want to avoid the mosaics of squares that requires changing the logic so that new client can only start at the next GOB, which may mean waiting a couple seconds. Or maybe we introduce an end-to-end signal to ask for "please rev a new GOB asap".

On the reconnect issue, we need to develop scenarios and test them. We have a subscribe intent option to "reconnect at GOB X, Object Y", which would do pretty much what you describe. But until we have tested it, we won't know whether it works as expected.

huitema commented 1 year ago

PR #112 set the cache clean up timer to 10 seconds. This appear to have fixed the original issue.

Quicr / old-quicrq

Support cached data clean up for published names #79