Open chombium opened 2 years ago
Solution number 1 kinda makes sense to me in some ways, but I am not sure how in practice it substantially differs from just reducing the size of the buffers. If we could give a overall memory budget to the diodes and the storage and track and prune it accurately, that'd maybe be cool. Long term, I'm still half inclined to replace log-cache, but that's not something I've really got any degree of alignment on.
Solution number 2 seems like it more directly solves the problems you're facing. Are you talking about essentially re-hashing envelopes that fall on unavailable nodes?(which maybe makes sense but sounds like non-zero work), or trying to assume there's a seperate network path that is available to that node and nodes don't have consistant views. I know there's some feature path's that technecially exist for replication, but I'm very unfamiliar with them and it's my understanding they might not work at all.
that said, would it make sense to do the work to better handle unavailable grpc connections and use that to not store data at all in the diode for that connection until they become available again? That, I think, would more surgically solve the issue you're facing, although wouldn't solve the availability of the data problem?
Thanks for the quick reply an the suggestions @Benjamintf1
For the solution 1 maybe as quick fix we could only make the size of the diodes configurable via an envvar which will be added to the bosh job as well. The one problem which we see is how to reliably determine what's the amount of memory which a diode uses, so that the pruning can be done properly.
The problems we are facing are possible problems of other as well. If someone has enough apps which have many instances they will end up in the situation that we are in. That's why we think that the solution no. 2 would be a better. As you've already written, this solution is essentially about re-hashing the envelopes and re-connecting the gRPC connections to the nodes which are available. We think that blocking the diodes is not a good solution as it would mean that some applications won't get logs and metrics, hence, no logs and metrics for some apps => bad user experience for the app developers, and what's more dangerous no metrics for some platform components. As the log cache nodes are only working together and don't have common state, we were also thinking of building some new, central component which will check the state of all log cache peer nodes and push updated list of the peer nodes to each node as depicted on the figure bellow. Each Log Cache node would get two new endpoints (gRPC, REST or something similar). One will be used for checking the health status of the node and the other where a current list of active log cache peer nodes would be sent. The new component will get the status of the log cache nodes via the health endpoint. Based on that the current list of the active log cache peer nodes will be generated. This list can be pushed to the nodes or the nodes can pull the list in the gRPC StatsHandler if the connection is closed.
We think that it would be good to solve this problem properly and possibly avoid or minimize log loss.
If the long term goal is to replace Log Cache, is there possibility to speed up the things. We would be glad to help ;)
Log-Cache Memory Issue
Problem Description
In case of an incident where one or more Log Cache nodes become unavailable (i.e availability zone outage or failure of multiple Log Cache nodes, CF update), the memory utilization of the nodes which are still accessible raises as they cannot send the received envelopes to the other nodes where they should be stored. The high memory usage causes the nodes to be overflooded with log envelopes which have to be sent to other nodes which are inaccessible. At the end the running nodes have high memory usage and cannot work properly.
This happens because of the Log Cache's source-id affinity. All of the log envelopes from a particular source-id are stored on a single Log Cache node, so that they can be easily and quickly been accessed via the Log Cache API. When a Log Cache node gets a log envelope it checks if it should store it locally and if that's not the case it sends it to the other nodes via gRPC. There is a BatchedIngressClient which creates a diode with gRPC connection to every other Log Cache node. At the moment there are no connection retries configured on the gRPC connections and the whole routing is setup only once when the Log Cache job is starting. This means that if one or more Log Cache nodes are down or the Log Cache process does not work, the gRPC connections will be disconnected and the diodes will be filled. As the diodes fill, the memory on the Log Cache VM raises and based on the number of VMs which are down and the number of the application instances deployed, it can quickly happen that the available memory on the healthy VMs is exhausted and the VMs becomes unresponsive.
Currently there is no other solution to this problem except that the log-cache process is restarted and the connections are created to that are working in that moment of time. There was already one PR opened which was rightfully dismissed which simply added a memory limit for the log-cache job and the process was restarted whenever the limit was reached.
Steps to Reproduce the Problem
So far we have found two methods how to reproduce this problem.
Method 1: Scale Log Cache up horizontally and simulate an outage
Method 2: Log Cache Cluster IP Address Configuration Manipulation
Possible solutions
1. Limit the available memory which can be used by the diodes
To limit the memory which is used by the diodes for the gRPC connections to the other Log Cache nodes, the same mechanism can be used which is used for limiting the memory usage of the Log Cache store. That would mean pruning old envelopes to make room for the new ones and limiting the memory use. Practically to have another StaticMemoryAnalyzer and PruneConsultant which will work with the stalled gRPC connections. The diodes are already limited by their capacity 10000 envelopes in this case, but in case there are many applications with multiple instances and the nodes where the data is stored are unavailable, the memory of the other nodes who should route the envelopes will be filled. That's why we would also need to limit the overall memory used by the diodes.
Pros:
Cons:
2. Detect connection problems and reconnect to the healthy nodes
The idea of this solution is to find a way how to detect when the gRPC connection is broken and to create a new connection to another Log Cache node without loosing data. This enables re-connecting of underlaying gRPC connection in the BatchedIngressClient to another log cache node without to touching the content in the diode. In this case it would be good if some mechanisms built in gRPC itself will be used.
In gRPC, the mechanisms to keep connections healthy are:
Keepalive. Can be used to define a time rate at which pings will be sent through the connection and a timeout for the response. If no response is received, the connection will be closed. It will try to reconnect with the already configured parameters.
HealthChecks. Requires an implementation of a health endpoint on the server and the clients have to query this endpoint to check the status of the server. If we use this method it would mean that we have to double the connections between the Log Cache nodes which might not be a good idea.
Stats Handler The Stats Handler can be attached to the gRCP client via the dialopts with the WithStatsHandler function. In this function the state of the connection can be checked and some actions can be taken.
The best solution would be to configure exponential backoff retries[2], and if the the maximum number of retries is reached and the connection is in a disconnected state, to close it, mark the peer node as unreachable and try to connect to another peer node. There should also be a mechanism which will mark the unresponsive nodes as active again. That can be simply done after some time has passed or the connections have to checked. Everything is depicted on the following sequence diagrams
In a normal operation mode the gRPC clients send data to the gRPC server and receive proper responses. When the client sends a RPC and there are connection problems, the gRPC runtime will start retrying to send the data until a successful sending happens or the maximum number of retries is reached. In case where the maximum number of retries is reached, the runtime will return UNAVALABLE to the client and the connection status will change. In the connection stats handler there is check of the status and if it's SHUTDOWN (stats.ConnEnd) the connection is disconnected (something similar like this), the log cache node is marked as unavailable and a new connection is created to a node which is available. The envelopes which are assigned to the unavailable node will be re-hashed to be stored on the newly selected available node.
There should be a mechanism to check the state of the gRPC severs on the log cache peer nodes. If a node which was marked as unavailable becomes available again, this node should be used again. As there is no shared state of all peer nodes, if the mechanism checks if the connection from one to another node can be established, this would mean that every node should open extra connections to the other nodes which will results in to much opened tcp connections. To solve this problem, a timeout mechanism can be used which will check if the node was marked as unavailable for some time configurable interval and if this interval has passed it will mark the node as available again. This means that every inaccessible node has a timestamp when it was marked as unavailable and if the time has passed and the configurable timeout interval is reached it will be marked as available.
Pros:
Cons:
What do you think about this? Is this worth to be considered as possible solution for the memory overloading of the Log Cache nodes when some of the nodes become unavailable?