amino-os / Amino.Run

Amino Distributed OS - Runtime Manager
Apache License 2.0
29 stars 12 forks source link

[PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers #794

Closed VenuReddy2103 closed 5 years ago

VenuReddy2103 commented 5 years ago

This PR is to measure the latency and data rate from each node to each other node available and give these metrics to OMS in existing heartbeat between kernel server and OMS. These received metrics handling on OMS is not part of this PR. This PR is same as old PR #742. Just raised based on new fork and closing the old PR.

quinton-hoole commented 5 years ago

Thanks @VenuReddy2103. Let me know when this is ready for review.

quinton-hoole commented 5 years ago

Thanks @VenuReddy2103 . Let me know when this is ready for review.

quinton-hoole commented 5 years ago

Sorry @venureddy2103 I was busy on other things today. Will try to review tomorrow.

quinton-hoole commented 5 years ago

@VenuReddy2103 I tried to review this today, but it's essentially impossible. I can't work out the relationship between this and #742, which I've already reviewed extensively. And the issues mentioned in the previous review of #742 appear not to have been addressed yet.

Please self review this, explain how you would like me to review this, and explicitly request review (with a written comment) when this is ready for my review. In the mean time I'm removing myself from the reviewers list.

quinton-hoole commented 5 years ago

@VenuReddy2103 Still no reply to https://github.com/amino-os/Amino.Run/pull/794#issuecomment-500074934 ?

VenuReddy2103 commented 5 years ago

Data rate calculation was not correct. Have fixed it. Will test further to find issues and fix. Will notify when ready for review

VenuReddy2103 commented 5 years ago

Still working on this PR. Progress so far -

  1. Have maintained the heartbeat data length to be sent to each server in existing KernelClient.KernelServerInfo .
  2. Same random byte array is used to send data all the servers but with respective data length maintained for the particular server.
  3. Have the same initial heartbeat data length set for all the servers. Increase this length by step size length each time data transfer rate < latency until it reaches the appropriate length where data transfer rate is significant. Have reviewed and tested it with different link speeds between kernel servers.

Yet to do: Current heartbeat frequency is 1second. And we send heartbeats and measure latency and data rates to all the available servers at that time. Need to optimize this process.

VenuReddy2103 commented 5 years ago

Metrics measurement process is independent for each server. And the frequency of measurement is also different. Following mechanism is used:

  1. metricsTimer and metricPollPeriod is maintained per kernel server(in KernelClient.KernelServerInfo). Initially started with MIN_METRIC_POLL_PERIOD_MS.
  2. As the Latency and data rates are consistent for MIN_STABLE_DATA_RATE_TIMES samples, we increase the time to measure metrics(metricPollPeriod) by twice. Continue to do the same till the metricPollPeriod becomes MAX_METRIC_POLL_PERIOD_MS. MAX_METRIC_POLL_PERIOD_MS ensures that metrics measurement frequency do not exceed this time.
  3. When data transfer rate < latency is observed, decrease the metricPollPeriod to MIN_METRIC_POLL_PERIOD_MS
Vishwa4jeet commented 5 years ago

Link speed between the two systems used for testing- 1000 Mbps.

KS1 192.168.59.2 running on system1 along with oms. Time taken for Data Rate to stabilize from KS1 to KS2: 101 seconds Final Data Length used in heartbeats: 65536

KS2 192.168.59.4 running on system2 . Time taken for Data Rate to stabilize from KS2 to KS1: 63 seconds Final Data Length used in heartbeats: 32768

Data rate unit is in Bytes/Sec. Latency is in nanoseconds (ns).

PFA the test logs below: ks-192.168.59.2.log ks-192.168.59.4.log

VenuReddy2103 commented 5 years ago

Have fixed APP Client not exiting issue. APP client creates a dummy local kernel server which are meant to route RPC calls through it to remote kernel server where MicroService it interacts reside. But that local kernel server is not registered to OMS and do not send heartbeats to OMS. We were measuring node metrics from that dummy local kernel server to all the remaining remote kernel servers. In fact, Such App clients do not allow deployment of MicroServices on them(and also do not participate in automatic migration of MicroService). Hence, they shouldn't measure node metrics.

Fundmover app logs: fundmover-app.log fundmover-ks1.log fundmover-ks2.log fundmover-oms.log

HanksTodo app logs: Have 4 Kernel servers with 2 servers in each region. hankstodo-app.log hankstodo-ks1.log hankstodo-ks2.log hankstodo-ks3.log hankstodo-ks4.log hankstodo-oms.log

KVStore app logs: Have 4 Kernel servers with 2 servers in each region. kvstore-app.log kvstore-ks1.log kvstore-ks2.log kvstore-ks3.log kvstore-ks4.log kvstore-oms.log

quinton-hoole commented 5 years ago

OK, to avoid further delays I'm going to merge this PR, and make the proposed improvements in followup PRs.