Handling observe request when a thread is stuck

Raveesh commented 6 years ago

I have been trying to observe and understand the thread behavior within californium. This issue is more of a clarification than a bug. I have been using 2.0.9 Milestone release.

Consider a scenario where we have 1000's of devices which are sending data to the cloud. The CoapClient is written in a cloud server and it sends an Observe request to all the devices in order to receive their data. When I see the CoapClient.java code, I see that a single instance of executor and a singleThreadPoolExecutor is used to perform an observe on all the devices.

Q1. Which part of the code can I look into to understand the behavior of this thread mechanism which is used. I am not able to wrap my brain around how a single thread is able to open so many connections with so many different devices at the same time and keep receiving data.

The second part of the question is more with respect to a behavior we observed. We tried to run some tests using a 15 thread pool executor and passed it to the CoapClient.java class. For some particular devices, while the client is receiving data, due to some stuck thread / some unknown reason, we find that the data stops coming to the client from the device all of a sudden. For lack of better understanding, i am assuming its because some thread is stuck . This stuck thread happens even if we use a SingleThreadExecutor continuously for 24-48 hours.

I am not really sure how to debug this issue. Because once the thread stuck issue is observed, i do not see any tcpdump logs as well, which hints to the fact that californium may not be requesting the next block at all.

Q2. Can you please guide me as to which part of the code should I debug to better understand the thread pool behavior during observe requests.

boaks commented 6 years ago

When I see the CoapClient.java code, I see that a single instance of executor and a singleThreadPoolExecutor is used to perform an observe on all the devices.

I'm not sure, may be you refer to CoapClient.useExecutor(). If so, there are other possibilities to provide Executors, which matches you requirements more. You may also use more than one CoapClient. In fact, the most performance considerations are based on a CoapServer in the Cloud and multiple CoapClients as devices. LWM2M exchange these roles in their "information reporting interface", but FMPOV this brings a lot of more pain, especially if your cloud-coap-client should be clustered. There have been some discussion there (e.g. https://github.com/OpenMobileAlliance/OMA_LwM2M_for_Developers/issues/171), so just be aware, that the performance may be the smaller issue :-).

Q1.
Californium organize the process in three layers:

Connector
Protocol / CoapStack
Application

The Connector offers (unfortunately different) ways to configure the executor behaviour. You will find methods to define the numbers of receiving and sending threads according your requirements (which means, test it out, which setup has the best performance for your OS and platform). Encryption/Decryption is also done in the DTLSConnector, and may require different setup (also, test it out). The CoapStack / Protocol is also executed by it's own Executor, either in the Endpoint or in the CoapServer and then shared by all Endpoints. You may change the default thread numbers also according your requirements (test it out).

The application layer is the either defined by the CoapClients and their executors or, on the server side, by the CoapResource, which may also have its own executor (see https://github.com/eclipse/californium/blob/2.0.x/californium-core/src/main/java/org/eclipse/californium/core/server/resources/ConcurrentCoapResource.java).

Q2. I' not sure, if I understand your scenario. You use notifies and blockwise transfer. At a certain point in time, you don't see any messages in the tcpdump? But, at least, the devices should send their notifies (if the observation was not cancelled). So, do you have any logs from the clients? Why do they not continue to send notifies?

As mentioned above, that LWM2M "information reporting interface" scenario has its pitfalls. Assuming you have NATs and load-balancer in between, neither, the observes nor the blockwise will work without "cheating". Both may cause a "change of the effective client endpoint visible by the server" and that is pain :-).

boaks commented 6 years ago

I tested that "notifies with exchange role" scenario using "cf-extplugtest-client and server". My system processed 20 million notifies about 3h, and doesn't stuck. That test doesn't use the CoapClient right now, nor does it run for 48h. But I'm still irritated about

i do not see any tcpdump logs as well.

and with 24-48 h and a setup "client in the cloud" and "(device) server somewhere else", I would assume, that your device is just reattached to the internet (say, using DSL router, which gets a new address), what breaks your observe relation.

So to continue: Can you test that in a local "intranet" setup?

Can you provide more detailed informations about your setup? Do you use DTLS? (ip-address changes may cause message drop by invalid MAC) Do you use CON notifies? (timeouts or other communication problems will "terminate" the observe relation. therefore the cloud-server/coapclient has to "reregister" from time to time, if no new notify arrives)

How are the devices attached to the internet? DSL? GSM? Does this cause ip-address changes? Does this cause temporary disconnections?

boaks commented 6 years ago

Any new on this?

boaks commented 6 years ago

Without additional information, I'm not able to work on this issue. If you have more information, please add them here or create a new issue.

Raveesh commented 6 years ago

Hi, Apologies for the late response.

@boaks Thanks for the detailed response. and apologies on my part for not responding earlier.

I just want to highlight the answers to your questions as much as I can

Can you provide more detailed informations about your setup? We wanted to simulate a scenario where we have deployed 1000's of sensors across diverse locations and which are sending data continuosly to the cloud and the rate of data change / resource changes are quite frequent (assume once in 2 seconds) . (just want to point out that although this is a pub sub pattern, we are testing with different protocols and for now we are looking at coap ).

the amount of data sent per resource is quite big as we need to cater to some format and requires around 8-12 blocks of transfer. The device coap stack is written by a team here on tinyos based on the specifications. The cloud side uses the californium client to observe these resources. So the scaling part that we need to test is more of the CoapClient and not of the CoapServer. There is a 6lowpan stack communication happening between devices and a controller and the controller has GSM which sends the data to cloud .

So the stack will kind of resemble

Cloud Coap Client (californium) + | | ( Repeated across 1000's of devices) v Controller + | |
v Device (server .. proprietary stack)

Due to GSM, the network connectivity is quite poor sometimes and the retry max limit reaches quite easily in the Reliability layer and observes are refreshed.

To throw a little bit more light into this , Since we wanted to test the setup with 1000's of devices, we wanted a CoapClient that can scale quite easily. But since this is not a GET which is more similar to a request response model, but rather a OBSERVE, There is a background thread that keeps listening for notifications . Each CoapClient in our case is created within an akka actor(akka was used for scaling clients and assuming that all get observes are handled by different threads) . but we quickly saw a max limit reach when the threads increased linearly using the default Coap Client. To tackle this issue we created a shared reference of the executor to share across all Coap Clients but this model is not suitable to our cause as it does not give a fine grained control over all devices. I am not sure if this is really a issue but according to our analysis the stack architecture looks as follows

CoapClient 1 CoapClient 2 CoapClient 3 .... CoapClient n + | | ( Repeated across 1000's of devices) v Reliability Layer Executor thread pool # threadpool count << n + | |
v UDP Connector (probably 1).

So this is more like an inverted pyramid architecture. So, I would like to rephrase my earlier question to this , There appears to be a bottle neck when we are scaling and creating 1000's of Coap Clients and because these are long running threads in the backend, I am not sure how i can scale them correctly. Is it right to create Coap Client objects per device. How should I decide on the number of threads in my executor to scale to 1000's or even millions of devices. ?

Apologies for the long post but, I hope some things are clearer for you to help us in this regard. Awaiting your reponse.

Thanks

boaks commented 6 years ago

Hi, Apologies for the late response.

collect so many information takes time :-).

Yes, the current coap-client implementation doesn't scale to good, may be "even less", since the introduction of the "NotificationListener" 2 years ago.

But I'm not sure, if this issue is now about the scaling and performance or about a "stucking thread". Can you please clarify it?

But since this is not a GET which is more similar to a request response model, but rather a OBSERVE, There is a background thread that keeps listening for notifications .

There is no difference between GET and OBSERVE about the threading! Generally californium tries to enable you to use separate executors for the different layers (connector, coap-stack, application), but it's principally not required to do so. So it's up to you, to find out, what works best for your requirements :-). If you don't call the setExecutor or useExecutor on the CoapClient, the processing of responses or notifications are not handover to an application-executor and is executed by the coap-stacks executor. See CoapClient:

    protected void execute(Runnable job) {
        ExecutorService executor;
        synchronized (this) {
            executor = this.executor;
        }
        if (executor == null) {
            job.run();
        } else {
            try {
                // use thread from the client executer
                executor.execute(job);
            } catch (RejectedExecutionException ex) {
                if (!executor.isShutdown()) {
                    LOGGER.warn("failed to execute job!");
                }
            }
        }
    }

And, as mentioned above, what's best for you depends on your requirements.

There is a background thread that keeps listening for notifications.

Only, if you use such an client-application-executor. There is also a timeout scheduler for observe-reregistrations. But this should only be one.

So, can you check, if it works better for you, not to use a client-application-executor?

boaks commented 6 years ago

To tackle this issue we created a shared reference of the executor to share across all Coap Clients but this model is not suitable to our cause as it does not give a fine grained control over all devices.

This makes me really wonder. What do you expect? Californium offers the possibility to setup the coap-clients either using their own executors (which came with the pain of many threads) or with a common executor (which came with the pain of sharing the thread resources), or reusing even the protocol executor (which came also with the pain of sharing the thread resources, but enables you to switch efficiently to your very own threading). I just don't see, if these options are all offered, which option you miss to match your use-case. So, what kind of threading you think you need for processing the notifies? I guess, you will need to implement it on your own, just using the protocol threads to initial deliver the notifies and then schedule the processing to what ever you need to use.

boaks commented 6 years ago

As I wrote above, if you have a alternative idea about the threading, just let us know.

eclipse-californium / californium

Handling observe request when a thread is stuck #700