Closed i1i1 closed 2 years ago
This is an interesting hack, but I don't think it is worth it, we can with the same success just run multiple instances with no code changes.
Let's try that and see if it would help. We deployed 2 local shards and that didn't help much (apart from memory issues). If this change doesn't help, we can just revert it.
Currently, for each ws connection shard
spawns like 2 tasks, while aggregator has only 1 task. So I just believe that it is not fairly scheduled and that is why we leak memory, but that might not be the truth.
We deployed 2 local shards and that didn't help much (apart from memory issues).
Didn't help with what? I thought that memory usage was the last issue we had. There were some Nginx errors that I tweaked by replacing localhost
with 127.0.0.1
.
Didn't help with what? I thought that memory usage was the last issue we had. There were some Nginx errors that I tweaked by replacing localhost with 127.0.0.1.
I just thought that node count is actually larger than 15k, so I thought the issue was with shards deployment.
I have not seen any errors sending data to telemetry on my side and no errors in logs, so I assume that is not the case. As we have 3 shards right now, so if there was an issue, it should have been in logs somewhere I think except if it is before reaching shard, in which case this PR will make no difference either.
Okay, so can I close #38 then?
Well, I think we can close it and open upstream issue to remove bottleneck there in whichever way they prefer. You can also link this PR as one of the examples of what can be done.
Long-term we need a completely different telemetry implementation anyway.
The only thing which can be a bottleneck in a shard (like a processing bottleneck) is sending data via shard message aggregator. This pr wraps several ws connection (which are now
AggregatorInternal
) inAggregator
structure.