Closed VenuReddy2103 closed 5 years ago
Thanks @VenuReddy2103. Let me know when this is ready for review.
Thanks @VenuReddy2103 . Let me know when this is ready for review.
Sorry @venureddy2103 I was busy on other things today. Will try to review tomorrow.
@VenuReddy2103 I tried to review this today, but it's essentially impossible. I can't work out the relationship between this and #742, which I've already reviewed extensively. And the issues mentioned in the previous review of #742 appear not to have been addressed yet.
Please self review this, explain how you would like me to review this, and explicitly request review (with a written comment) when this is ready for my review. In the mean time I'm removing myself from the reviewers list.
@VenuReddy2103 Still no reply to https://github.com/amino-os/Amino.Run/pull/794#issuecomment-500074934 ?
Data rate calculation was not correct. Have fixed it. Will test further to find issues and fix. Will notify when ready for review
Still working on this PR. Progress so far -
KernelClient.KernelServerInfo
. Yet to do: Current heartbeat frequency is 1second. And we send heartbeats and measure latency and data rates to all the available servers at that time. Need to optimize this process.
Metrics measurement process is independent for each server. And the frequency of measurement is also different. Following mechanism is used:
metricsTimer
and metricPollPeriod
is maintained per kernel server(in KernelClient.KernelServerInfo
). Initially started with MIN_METRIC_POLL_PERIOD_MS
. metricPollPeriod
) by twice. Continue to do the same till the metricPollPeriod
becomes MAX_METRIC_POLL_PERIOD_MS
. MAX_METRIC_POLL_PERIOD_MS
ensures that metrics measurement frequency do not exceed this time.metricPollPeriod
to MIN_METRIC_POLL_PERIOD_MS
Link speed between the two systems used for testing- 1000 Mbps.
KS1 192.168.59.2 running on system1 along with oms. Time taken for Data Rate to stabilize from KS1 to KS2: 101 seconds Final Data Length used in heartbeats: 65536
KS2 192.168.59.4 running on system2 . Time taken for Data Rate to stabilize from KS2 to KS1: 63 seconds Final Data Length used in heartbeats: 32768
Data rate unit is in Bytes/Sec. Latency is in nanoseconds (ns).
PFA the test logs below: ks-192.168.59.2.log ks-192.168.59.4.log
Have fixed APP Client not exiting issue. APP client creates a dummy local kernel server which are meant to route RPC calls through it to remote kernel server where MicroService it interacts reside. But that local kernel server is not registered to OMS and do not send heartbeats to OMS. We were measuring node metrics from that dummy local kernel server to all the remaining remote kernel servers. In fact, Such App clients do not allow deployment of MicroServices on them(and also do not participate in automatic migration of MicroService). Hence, they shouldn't measure node metrics.
Fundmover app logs: fundmover-app.log fundmover-ks1.log fundmover-ks2.log fundmover-oms.log
HanksTodo app logs: Have 4 Kernel servers with 2 servers in each region. hankstodo-app.log hankstodo-ks1.log hankstodo-ks2.log hankstodo-ks3.log hankstodo-ks4.log hankstodo-oms.log
KVStore app logs: Have 4 Kernel servers with 2 servers in each region. kvstore-app.log kvstore-ks1.log kvstore-ks2.log kvstore-ks3.log kvstore-ks4.log kvstore-oms.log
OK, to avoid further delays I'm going to merge this PR, and make the proposed improvements in followup PRs.
This PR is to measure the latency and data rate from each node to each other node available and give these metrics to OMS in existing heartbeat between kernel server and OMS. These received metrics handling on OMS is not part of this PR. This PR is same as old PR #742. Just raised based on new fork and closing the old PR.