boaks commented 4 years ago

Though some issues around the SessionCachehave been raised, it may be a good idea to start a discussion about this SessionCache and the intended usage.

Some related issues of the past as starting point:

948 1163

1343

From #1343: The SessionCache seems for me to be not completely "specified". I don't know, if it is assumed to be "consistent" or "eventually consistent". The "removes" in case of a missing session id in the cache seems to point to the first, but intended "remote implementations" would then slow down too much. So, FMPOV, there is even specification work left, which then may change the implementations. That will be "time consuming".

I currently don't use this SessionCache and currently I prefer to improve the connection id usage in a cluster. Therefore my plan to invest time is more to make the "connection id cluster" working, than a clustered session resumption. With that it depends on others (including you :-) ), how much time you want to spend into the SessionCache.

sbernard31 commented 4 years ago

I don't know, if it is assumed to be "consistent" or "eventually consistent".

Could you explain what you mean by "consistent" or "eventually consistent" ?

Therefore my plan to invest time is more to make the "connection id cluster" working

Could you describe what doesn't work for now ? And what would mean exactly "connection id cluster" working ? Are you talking about a share/persistent connection store ?

boaks commented 4 years ago

Could you explain what you mean by "consistent" or "eventually consistent" ?

"eventually consistent"

Generally I didn't invest too much time in understanding the SessionCache. One of the outcomes of this discussion may be, that it gets clear, which "consistent" model is used and how the connection store then uses the session cache (e.g. no remove in find).

Could you describe what doesn't work for now ?

There is simply no adoption of the DTLS-1.2-connection-ID-based-load-balancer. With that, I would like to experiment to forward the messages to the right node within scandium. e.g. check the cid, if it's not the own, forward the message to the right scandium node. Add the original source address to the udp-message, so that a java implementation is possible. That will not be that performant as a "external load balancer solution", but hopefully it works good enough to gain more attraction, more user, and so the chance to get the adoption for a real cid-load-balancer.

Are you talking about a share/persistent connection store ?

No, in my opinion the "record sequence numbers" will prevent such a implementation.

sbernard31 commented 4 years ago

forward the message to the right scandium node

I don't know if we are representative, but if one day we use connection-Id, I'm pretty sure we will choose the load-balancer solution.

About a load-balancer dedicated to CoAP/LWM2M/Scandium use cases, I'm still working on it based on XDP. It's a very young project but we begin to use it in production.

I'm currently working on ipv6 support but I still have the connectionID support in mind. (but this is not consider as a priority by Sierra Wireless for now :-/)

No license was chosen for now but we want to make it open source.

If we want to make connection-ID more accessible, maybe we also need a wireshark support.

sbernard31 commented 4 years ago

I read the "eventually consistent" but still not crystal clear to me.

Anyway my 2 cents, We have a cluster Server based on Scandium/Californium behind some LVS or SBULB load-balancer. We currently don't use SessionCache but we could use it to increase our number of abbreviated handshake (instead of full handshake). The needs would be :

share the session between nodes
persist session to keep it after failure or re-deployment.

My feeling is this is acceptable to lost few session in cache meaning that some time we will not be able to do an abbreviated handshake. About the remove, I understand it as : "This session MUST NOT be resumed anymore" (lifetime expire, credentials compromised, ...) and a remove must be safe/guaranty.

boaks commented 4 years ago

About the remove, I understand it as : "This session MUST NOT be resumed anymore" (lifetime expire, credentials compromised, ...) and a remove must be safe/guaranty.

Or, the session is not already completely transferred to the cache. In that case, a connection will be accidentally removed.

sbernard31 commented 4 years ago

Or, the session is not already completely transferred to the cache. In that case, a connection will be accidentally removed.

I was talking about my needs as user, not how it behave currently.

sbernard31 commented 4 years ago

As a user, Removing a session from the cache should only prevent to resume a session but not remove an existing connection. I talk about DTLS concept, I know DTLS concept does not really match Scandium Classes)

sbernard31 commented 3 years ago

Just to let you know, we begin to use SessionCache in a cluster environment in production. First impression is pretty good. Thanks to it more sessions can be resumed (limiting the number of full handshake).

sbernard31 commented 3 years ago

I should maybe ask this in another thread ? :think tell me if this is better.

Since we add SessionCache, we see a significant increase of CPU utilization. (from 10% cpu usage to 15%) For now we have no real explanation of this. We use setVerifyPeersOnResumptionThreshold(0)

Reading the code about SessionCache, I can not see what could generate this. I see a lot of Byte arrays.copy/clear (because of secretKey on SessionTicket.encode/decode) but I would be very surprised if this was related to that.

erikmell commented 3 years ago

Just to let you know, we begin to use SessionCache in a cluster environment in production. First impression is pretty good. Thanks to it more sessions can be resumed (limiting the number of full handshake).

Would you like to elaborate on your setup using SessionCache. We have just recently set SessionCache up for redundancy/load balancing but still have some obstacles to fix.

To survive server restart we use the ClientSessionCache. This can take pretty long to start up when there is a lot of Sessions. What strategy do you use to clean sessions? For NAT:ed devices IP/port mapping will be outdated pretty quick but connectionId could make it last longer. For fixed ip:s the lifetime can be longer. But the sessionCache does not know when the session was used, so its hard to update/clear a TTL any thoughts on this?
Normally our loadbalancer will be sticky based on ip/port, if a node(1) goes down traffic will be routed to another node(2), for a new connection using the same session this works, but ongoing connections are broken. In get() there is a log "{}connection: missing connection for{}!". I guess that node(2) would need to sync with the cache before get to support this?

boaks commented 3 years ago

but ongoing connections are broken.

TLS/DTLS is based on two stage. The first negotiates the crypto parameter and the master secret. That's called session. The second contains then the keys and the record sequence numbers, that's called connection (TLS) or association (DTLS). For DTLS, the received record sequence numbers must be kept in a "window buffer". The data in the session is negotiated on full handshakes and are long lasting. The data in the association is split into middle lasting (keys) and short lasting (record sequence numbers), they change with every exchange records.

In Californium the data is split into the DTLSSession and the DTLSConnectionState. "Historical" (and maybe changing in a future version :-) ), the DTLSSession maintains also the record sequence numbers.

Californium supports a SessionTicket ( a serialized session), but that doesn't contain the association state nor the record sequence numbers. Also some of the later added crypto parameter are missing, I hope, I can add them with a 3.0.

Currently you can just persist the session, which requires a "resumption handshake" to get the association data. To persist the association data, I consider in the future a graceful shutdown in order to be able to save the "record sequence numbers" (they must not change afterwards). But for now, it's not implemented. If you consider/plan to implement it, I would welcome a contribution. But even an experience report will be really helpful.

This can take pretty long to start up

Can you give us a impression, how many session take which time? My feeling is, this may be more related to the persistent mechanism.

erikmell commented 3 years ago

Currently you can just persist the session, which requires a "resumption handshake" to get the association data. To persist the association data, I consider in the future a graceful shutdown in order to be able to save the "record sequence numbers" (they must not change afterwards). But for now, it's not implemented. If you consider/plan to implement it, I would welcome a contribution. But even an experience report will be really helpful.

Do you think It could be feasible to experiment with loading the session from SessionCache if the get() on a failover-node does not find a local matching connection, forcing a resumption handshake?

This can take pretty long to start up

Can you give us a impression, how many session take which time? My feeling is, this may be more related to the persistent mechanism.

When I experienced problems we had about 120000 sessions. On the server it took 12+hours, but I have not tracked the problem down yet, when loading from a local copy of the DB (redis) it only took a couple of minutes so I guess the problem is within the server environment and not an californium issue.

boaks commented 3 years ago

Do you think It could be feasible to experiment with loading the session from SessionCache if the get() on a failover-node does not find a local matching connection, forcing a resumption handshake?

Your dtls-client has three possibilities:

starting a full handshake (without available valid session ID)
starting a resumption handshake (with valid session ID). The dtls-server may refuse the resumption and switch into a full handshake.
sending an application record. The dtls-server either decrypt the record successfully, or drops it silently.

I'm not sure, if you want to use a DTLS role exchange (means the server will act as dtls-client). I guess not. If not, it's not the dtls-server, which chose to resume a session, it's the dtls-client. The dtls-server can only refuse that. But even more, without a handshake from the dtls-client, the dtls-server is intended to drop that message!

The "session cache" is only involved, if a dtls-client want's to resume such a session. Otherwise the resulting session is just stored their. You chose to use the ClientSessionCache for the dtls-server. But that will only be able to restore connection with available SessionTickets. The intended usage is a dtls-client, which use that ticket to resume a session. On the dtls-server side, this doesn't work. For that dtls-server side the SessionCache is used to lookup for sessions, only when a dtls-client wants to resume such a session.

I assume, you want a scenario, where a "application record" is received by a failover-node. The would require a really "high frequently update of the association states (record sequence numbers)". FMPOV, that's not really performant. The "load" to execute a coaps request is such small, that the load for sync the states would be relevant. It may be also a wrong assumption, that other technique works like that (that is rare). Basically the most detect the failure and (re-)connect to the failover-node.

When I experienced problems we had about 120000 sessions. On the server it took 12+hours, but I have not tracked the problem down yet, when loading from a local copy of the DB (redis) it only took a couple of minutes so I guess the problem is within the server environment and not an californium issue.

I would wonder, if loading 120000 session tickets would take more than a few seconds for californium. But @sbernard31 has sure more experience on what to expect from redis.

boaks commented 3 years ago

By the way: the javadoc of the SessionCache and ClientSessionCache are misleading and not really maintained for long. Too much other work ...

sbernard31 commented 3 years ago

Some explanation about how and why we use it. We are deploying several server (Leshan-LWM2M) in cluster behind loadbalancer.

The use case is mainly about devices behind NAT (dynamic ip environment). Unfortunately, we are not using connection id, and so to resolve this issue all devices start communication by a handshake. A full one if device has no session else an abbreviated try.

Without SessionCache, session are not persisted and not shared meaning :

in case of server reboot, all "abbreviated handshake try" will turn into "full handshake"
in normal use case, you have 1/x chance (x is the number of instance in the cluster) to succeed your abbreviated handshake.

The idea to use SessionCache is just to maximize the number of abbreviated handshake (or more precisely to decrease the number of abbreviated handshake which will turn into full handshake)

I guess the benefits should be :

less resources usage at device side. (cpu ? bandwith?)
less resources usage at server side (cpu ? bandwidth?) (for bandwidth this is true mainly for RPK/x509 not so true for PSK)

On our side we did that because of some devices which behaves badly when abbreviated turn into full handshake... We expected also some cpu benefits at server side => currently this is pretty much the opposite :sweat_smile: but anyway devices are happier now.

We are using redis and implements only SessionCache(not ClientSessionCache as currently we don't care about initiated abbreviated handshake at server side). Our keys expire after a while (something like 7days but could be less in the future) without r/w access. If session is lost, it's not a big deal as we see that as an optimization. I mean the device must support the case where the session doesn't exist anymore. To avoid to create too many Session in store, when a session is created we remove previous session from the same peer. (in our case we use PSK ID as peer identifier)

Our RedisSessionCache could be added in Leshan at midterm.

boaks commented 3 years ago

=> currently this is pretty much the opposite sweat_smile

It's mainly a question, how much cpu a full handshake takes more than a resumption handshake, compared to the "remote call" of the session cache. As you wrote, it's more for the clients, than for the performance.

sbernard31 commented 3 years ago

I tried to do some tests about this. My current procedure : Using sc-dtls-example-client, I modified it to do a handshake (full or abbreviated) for each message sent. Using sc-dtls-example-server, I modified it to use RedisSessionCache (or not). I use only PSK. DTLS client, server and redis run on the same machine.

I did 4 runs for each case and ignore result of first run. Each run, I send 1 client which send 30000 messages. (each message = 1 handshake)

As result, I used message per seconds returns by sc-dtls-example-client and cpu cummulated usage (TIME) from ps command.

(I didn't find a best way for now :thinking: )

The result:

Full Handshake	message per s	cpu TIME	cpu TIME/run
1rst run	1522	00:24:00
2nd run	1835	00:36:00	00:12:00
3rd run	1765	00:50:00	00:14:00
4th run	1898	01:02:00	00:12:00
mean (without 1st run)	1833		00:12:40

Abbreviated Handshake	message per s	cpu TIME	cpu TIME/run
1rst run	2262	00:19:00
2nd run	2514	00:28:00	00:09:00
3rd run	2503	00:38:00	00:10:00
4th run	2913	00:45:00	00:07:00
mean (without 1st run)	2643		00:08:40

Abbreviated + session Cache	message per s	cpu TIME	cpu TIME/run
1rst run	1381	00:24:00
2nd run	1650	00:36:00	00:12:00
3rd run	1555	00:54:00	00:18:00
4th run	1594	01:07:00	00:13:00
mean (without 1st run)	1600		00:14:20

I tried a little optimization where I check is session exist before to put it in Redis. Abbreviated + session Cache ()	message per s	cpu TIME	cpu TIME/run
1rst run	1537	00:23:00
2nd run	1688	00:36:00	00:13:00
3rd run	1650	00:51:00	00:15:00
4th run	1885	01:02:00	00:11:00
mean (without 1st run)	1741		00:13:00

"Message per s" and "cpu TIME/ run" sounds inversely proportional which makes sense :thinking:

I had a look with visualVM profiler, and hotspot seems to be around put/get access of sessionCache. In particular, all redis call. About Deserialization/Serialization this sounds less impacting than redis call but decoding SessionTicket sounds more expensive than encoding it. This sounds to confirm previous result.

There is maybe some solution to improve this :

Put session in cache only for full handshake. (If I well understand session does not change after resume so we don't need to add it again ?)
Do not try to look in SessionCache if session is already in InMemoryConnectionStore (this changes the behavior a bit meaning that removing from SessionCache does not guaranty that the session is totally removed => This is opposite to my previous "needs description" but maybe it make sense if it improves performance)
we could add a contain method to SessionCache which could avoid some useless TicketSession decoding.

boaks commented 3 years ago

Put session in cache only for full handshake.

I'm not sure, if such sessions then may expire. But that may depend more on the intended session-management of the cache.

we could add a contain method to SessionCache which could avoid some useless TicketSession decoding.

Sure, but above you wrote, that it is "less impacting", so I wouldn't expect a large improvement.

I guess, if you adapt your test with a certain amount of application records and then do the handshakes in parallel, then you may also see the "blocks", as mentioned in #1343

So FMPOV, one of the important decisions is, if the PUT/GET/DEL of the connection store must be in atomic with these operation in the session-cache, or if they could be decoupled. But, as I wrote at the begin, I still busy with other stuff. So, everybody is welcome to help.

sbernard31 commented 3 years ago

Here is my current understanding.

Firstable, wording is bit confusing because in fact SessionCache is a kind of Persistent/Shared Session Store and InMemoryConnectionStore is a connection store + a kind of local session cache. So below I will use store(SessionCache) and local cache(map in ConnectionStore) wording.

If we want to focus on performance : 1) We should look in store only if value is not in local cache. (this will decrease drastically the access to store) 2) Local cache should expire some time to force to recheck value in store.

In fact all of this could be achieve in a SessionCache implementation with an internal map as cache + requesting redis store when needed. But the internal map will be a kind of duplicate of local session cache in InMemoryConnectionStore.

So I'm currently thinking that maybe a right way to do will be to :

Remove the local session cache in InMemoryConnectionStore.
provide a default inMemorySessionCache.

PUT/GET/DEL of the connection store must be in atomic with these operation in the session-cache, or if they could be decoupled.

"Decoupled" means executed in another thread ? For GET, I'm not sure it's worth creating an async API. (especially if we have a local cache which prevent access to store) For PUT/DEL, If think it could be decoupled but maybe it depends of use cases. Note that this decoupling (for PUT/DEL) could be done internally in SessionCache, I mean implementation can create new task to put/delete data in store by its own thread pool. (Or I missed something?)

boaks commented 3 years ago

For GET, I'm not sure it's worth creating an async API.

No, but a async PUT will change the behavior of the GET. At least the current implementation, without "caching" the GET, will behave different, if the PUT is async, because a new GET may have a different result, if the PUT is not completed.

could be done internally

Yes, but it changes the details of the GET and maybe DEL (see above). So define a async API would in my opinion also need to define the interaction of both.

For changing the names: I think, that after a 2.5 starting over with a 3.0 would make sense, because I would like to get rid of the deprecates. That would be then also a chance to adapt the names.

boaks commented 3 years ago

Just for information:

The #1424 gets more and more available.

built-in CID-load-balancer

Feedback is very welcome.

erikmell commented 3 years ago

Currently you can just persist the session, which requires a "resumption handshake" to get the association data. To persist the association data, I consider in the future a graceful shutdown in order to be able to save the "record sequence numbers" (they must not change afterwards). But for now, it's not implemented. If you consider/plan to implement it, I would welcome a contribution. But even an experience report will be really helpful.

Thank you all for your detailed answers and all the effort you put in to Californium. We are about to release a series of battery powered devices with with lifetime >7-10 years. Hence the need for service-windows on the server side with graceful shutdown restart come up, Every handshake is shorter lifetime for these devices. It is very likely that we will start laborate on this quite soon. Should I file a separate ticket to continue discussions on that? I'm pretty certain we will get a OK from our managers to make a contribution.

boaks commented 3 years ago

to make a contribution.

Welcome.

Just to mention: If your device comes up with "different" ip-endpoints (address:port) at the server, you may need DTLS CID implemented on the client as well. That is required even without graceful-startover just in order to prevent from handshake after a quiet period. See Bypassing-NATs.

My current plan was to start over with 3.0 after the 2.5.0, removing the deprecates and then clean up the Connection/DTLSSession/DTLSConnectionState in order to prepare for the graceful-shutdown-restart. My feeling is, that will take until Christmas (the preparation not the graceful-startover :-) ). I'm not sure, if a faster timeline is realistic.

erikmell commented 3 years ago

If your device comes up with "different" ip-endpoints (address:port) at the server, you may need DTLS CID implemented on the client as well. That is required even without graceful-startover just in order to prevent from handshake after a quiet period. See Bypassing-NATs

Thank you, we have CID enabled on the devices using mbed tls. We have both scenarios, devices on private-apn with static ip, and devices coming via public APN:s that frequently changes IP.

boaks commented 3 years ago

For the "graceful shutdown" I prepared a first PoC in PR #1468

Checkout that PR, the "demo-apps/cf-extplugtest-server" contains a section Benchmarks - Graceful Shutdown with instructions. That very simple demo just serializes all connections and removes them from the connector. afterwards they can be loaded again into the connector.

boaks commented 3 years ago

I started to test some ideas about the "distributed session cache":

full handshake:

after success, transfer the session (async PUT).

resumption handshake:

same node/session may be found locally. FMPOV, a (async GET) request is required to ensure, that the session is still valid (means not aborted by an alert). That requires also the "state" of the transfer of that session on the full-handshake.
session is not found locally, then it requires a (async GET) request.

With that, I think, deployments, which use a "distributed" session store, MUST always use a HelloVerifyRequest, because they always require to do an potential "expensive" GET request.

(async) requests are required, to relax the thread pool (reduce the influence of handshakes according other appl-records).

Alternative it may be assumed, that there are no alert ... (I don't feel too comfortable with that).

boaks commented 3 years ago

A question about a corner-case:

Full-handshake:

successful, but the transfer of the session fails. What should happen?

sbernard31 commented 3 years ago

successful, but the transfer of the session fails. What should happen?

It seems to me that persisting the session is just best effort if we lost some It's OK. I mean when this happen this is just a full handshake instead of abbreviated one.

Currently, I have more concern about how to be able to remove all connection + session for given compromised credentials in a way we can be sure there is no race condition. (I didn't think too much of this, but I feel that we can maybe face some problem here ? :thinking:)

same node/session may be found locally. FMPOV, a (async GET) request is required to ensure, that the session is still valid.

Please read this https://github.com/eclipse/californium/issues/1345#issuecomment-700732372 again (if this is not already done :grin:), I feel we should maybe change this idea about local cache / Session cache and replace this by a SessionStore with a default in memory implementation and so it's up to the user to create a persistent one with eventually local memory cache. This way it's up to him to check or not in persistent store when there is something in local memory cache. (I hope this is not so confusing)

boaks commented 3 years ago

Session cache and replace this by a SessionStore with a default in memory implementation.

FMPOV, the current InMemoryConnectionStore should be at last very close to that. It is possible to have connections only with a session or ticket, but no address. There maybe some unseen pitfalls, but generally, that should work.

boaks commented 3 years ago

Currently, I have more concern about how to be able to remove all connection + session for given compromised credentials in a way we can be sure there is no race condition. (I didn't think too much of this, but I feel that we can maybe face some problem here ? thinking)

FMPOV, it usually takes a long time until such compromised credentials are detected. With that, it's more important, that "over the time", all connections in the "cluster" are removed. But maybe I misinterpret the "race condition".

sbernard31 commented 3 years ago

FMPOV, the current InMemoryConnectionStore should be at last very close to that. It is possible to have connections only with a session or ticket, but no address. There maybe some unseen pitfalls, but generally, that should work.

I'm not sure if you agree or not with my idea ?

sbernard31 commented 3 years ago

FMPOV, it usually takes a long time until such compromised credentials are detected. With that, it's more important, that "over the time", all connections in the "cluster" are removed. But maybe I misinterpret the "race condition".

I try to explain a bit more. :-) Imagine you don't want to allow a given credential to be used anymore (E.g. a given psk ID because it is compromised or just changed) Currently my idea (maybe I'm wrong), is to remove all in memory connection in the cluster and remove session in share store. My concern is that maybe as those 2 operations are not done at same time, we could face some race condition which could end with some session or connection which are not deleted.

Eg. I remove connection, then I remove session. (but between a new connection was created)

boaks commented 3 years ago

I'm not sure if you agree or not with my idea ?

I don't see anything, which holds someone up from trying that. Therefore my hint, that the current implementation should already be able to store such "sessions only". Or do I oversee something?

sbernard31 commented 3 years ago

I think about maybe removing : protected final ConcurrentMap<SessionId, Connection> connectionsByEstablishedSession; from InMemoryConnectionStore This way when we resume a session we only search in SessionStore (or SessionTicketStore) and no more in this local store :point_up:.

The idea, user implements its SessionStore and use local memory cache if needed and chose this local cache behavior.

(This refactoring could maybe be a chance to also refactor Connection/DTLSession class in a way it better reflect DTLS concept : https://github.com/eclipse/californium/issues/1163#issuecomment-564692966 by maybe removing DTLSSession)

So a connectionStore which store "dtls connection" and a SessionTicketStore which stores SessionTicket.

boaks commented 3 years ago

My feeling is, remove the "SessionCache" and try out, if the current connection store does it.

sbernard31 commented 3 years ago

I'm lost I have no idea how this could be possible with current connection store ? Unless I don't get what you call "current connection store".

boaks commented 3 years ago

My understanding of such a in memory solution: If a session gets established, transfer that to other nodes. On other nodes, add the transferred session to the connection store. My point is, if it's a "direct in memory solution", the sychronization seems to be critical and so I would remove the SessionCache callbacks.

sbernard31 commented 3 years ago

I think you don't understand me at all and I don't know how to explain it better :-/ My idea is about refactoring the code to make all those concept clearer and allow following usecase :

in memory sessionTicketStore (default)
persistent SessionTicketStore (for failover)
shared persistent SessionTicketStore (for cluster)
shared persistent with local memory cache (for cluster and optimized access to shared store)

(unless it's me who doesn't understand you :sweat_smile: )

boaks commented 3 years ago

My idea would be, Start with a PoC for such a in memory session store. Demonstrates, that it works to distribute the session ahead. To do that "fast", I would just use the connection store. It possible to store connections with tickets or sessions only there. At least, I think it should be possible. I don't think, that define (theoretical) APIs first works better. At least not in my experience.

boaks commented 3 years ago

Let me add:

The idea of a "pre-distributed session in memory solution" seems to be the simplest, if it works.

If not, my impression so far is, that a "async remote API", will not really fit into that session cache nor connection store. Such a API requires a very different redesign. I started to think on also mark session ids with node-ids (similar to the cid), and then just on resumption try to load the session from such a specific node (using the introduced cluster internal communication). Such an approach will unfortunately require a lot of development and test time.

sbernard31 commented 3 years ago

I don't think, that define (theoretical) APIs first works better. At least not in my experience.

Clarify concept, separate concern and define API first is how I work on Leshan trying to keep the code as understandable as possible.

boaks commented 3 years ago

@erikmell

PR #1486 gets more and more "usable". If you're interested in first experience, your feedback will be welcome.

erikmell commented 3 years ago

@erikmell

PR #1486 gets more and more "usable". If you're interested in first experience, your feedback will be welcome.

Sorry for not answering in a long time. Had a long christmas leave and after coming back there's been too much to do and some organisational changes. This will definitely be a good thing for our battery powered NbIoT devices, not forcing a handshake after server restart.

boaks commented 3 years ago

The "wip feature" is on master.

Some docu how to use it for the extended plugtest demo server is here:

Benchmarks - Graceful Shutdown

If you want to test it on your own server's, CoapServer.saveAllConnectors is a good point to start.

Just to mention: It's wip, this means especially, californium may change the data format. As long as you use it for updates of your parts keeping Californium at the "commit", you should be able to use it.

boaks commented 3 years ago

The blue/green update with DTLS graceful restart is available for testing!

Checkout PR #1547 !

boaks commented 3 years ago

@sbernard31

The master (upcoming 3.0.0-M1, when ever :-) ), contains now the redesigned DTLS-state. The previous DTLSSession is split into DTLSSession and DTLSContext. I'm still not sure, if renaming "connection" into "relation" really clarifies more than it mixes up, so I kept that term "connection".

With that, it seem for me:

if a "session cache" (or renamed to "SessionStore") is used, the dtls-connection-store may use that instead of the own connectionsByEstablishedSession (the change in the implementation is pending).
if a "session cache" is NOT used, I would prefer to keep using the current approach with connectionsByEstablishedSession.

Therefore I think, removing the connectionsByEstablishedSession is too much, but disable the use, if a "Session Cache/Store" is used, that should work.

FMPOV, with the current interface SessionCache (maybe renamed to Store), it should be possible to implement the solution you proposed. On "established" would then require not only to store it, it mus also deploy it to other nodes.

So:

Rename SessionCache into SessionStore?
disable the use of connectionsByEstablishedSession, if the SessionCache/Store is provided?

sbernard31 commented 3 years ago

The previous DTLSSession is split into DTLSSession and DTLSContext

:+1:

I'm still not sure, if renaming "connection" into "relation" really clarifies more than it mixes up, so I kept that term "connection".

I understand this could be a lot of renaming. What currently make the current Connection concept from Californium differ from the DTLS concept ? Regarding https://github.com/eclipse/californium/issues/1163#issuecomment-564692966, I guess that :

DTLS Session concept matches DTLSSession
DTLS connection concept matches DTLSContext ?
and Connection is a container for information of the relation which is not so clear to me.

Therefore I think, removing the connectionsByEstablishedSession is too much, but disable the use, if a "Session Cache/Store" is used, that should work.

Considering :
"if a "session cache" (or renamed to "SessionStore") is used, the dtls-connection-store may use that instead of the own connectionsByEstablishedSessionconnectionsByEstablishedSession" If we provide an InMemorySessionStore we should have the same behavior than without SessionStore at all. If this is true then we should be able to remove connectionsByEstablishedSession attribute from InMemoryConnectionStore ? I guess I totally missed something ? :sweat_smile:

On "established" would then require not only to store it, it mus also deploy it to other nodes.

I don't get this part.

Rename SessionCache into SessionStore?

If you agree this is more a store than a cache, I think renaming makes sense.

disable the use of connectionsByEstablishedSession, if the SessionCache/Store is provided?

I sill think the connectionsByEstablishedSession should be deleted but probably because I missed something.

boaks commented 3 years ago

Basically, I prefer to have the "SessionCache/Store" optional. With that, I don't plan an implementation of that interface. And therefore keeping "connectionsByEstablishedSession" as "default" is for me the preferred way. (An alternative would be to add a simple in-memory-hashmap based implementation by default, if no other is provided.)

On "established" would then require not only to store it, it mus also deploy it to other nodes. I don't get this part.

Hm, I considered that as the solution you proposed. If not, simply ignore it.

boaks commented 3 years ago

If you agree this is more a store than a cache

The difference of these terms is for me:

a store has similar access times, if the same items are requested.
a cache may speed up, if the same items are requested.

Though a "time consuming get" would require a change in the synchronization scope (with that, the issues started), I think, a "cache" behavior is not , what should be the defined behavior.

sbernard31 commented 3 years ago

An alternative would be to add a simple in-memory-hashmap based implementation by default, if no other is provided.

I feel it would be better. This way there is only 1 behavior and just several implementation, it seems more straightforward to me. I can not the the benefits of the other alternative(keeping connectionsByEstablishedSession attributes) ? Do you have a specific one in mind ?

eclipse-californium / californium

Evolving the SessionCache #1345

948

1163

1343