autonomys / substation

Polkadot Telemetry service
GNU General Public License v3.0
3 stars 2 forks source link

Add possibility to have several connections to core from shard #42

Closed i1i1 closed 1 year ago

i1i1 commented 1 year ago

The only thing which can be a bottleneck in a shard (like a processing bottleneck) is sending data via shard message aggregator. This pr wraps several ws connection (which are now AggregatorInternal) in Aggregator structure.

i1i1 commented 1 year ago

This is an interesting hack, but I don't think it is worth it, we can with the same success just run multiple instances with no code changes.

Let's try that and see if it would help. We deployed 2 local shards and that didn't help much (apart from memory issues). If this change doesn't help, we can just revert it.

Currently, for each ws connection shard spawns like 2 tasks, while aggregator has only 1 task. So I just believe that it is not fairly scheduled and that is why we leak memory, but that might not be the truth.

nazar-pc commented 1 year ago

We deployed 2 local shards and that didn't help much (apart from memory issues).

Didn't help with what? I thought that memory usage was the last issue we had. There were some Nginx errors that I tweaked by replacing localhost with 127.0.0.1.

i1i1 commented 1 year ago

Didn't help with what? I thought that memory usage was the last issue we had. There were some Nginx errors that I tweaked by replacing localhost with 127.0.0.1.

I just thought that node count is actually larger than 15k, so I thought the issue was with shards deployment.

nazar-pc commented 1 year ago

I have not seen any errors sending data to telemetry on my side and no errors in logs, so I assume that is not the case. As we have 3 shards right now, so if there was an issue, it should have been in logs somewhere I think except if it is before reaching shard, in which case this PR will make no difference either.

i1i1 commented 1 year ago

Okay, so can I close #38 then?

nazar-pc commented 1 year ago

Well, I think we can close it and open upstream issue to remove bottleneck there in whichever way they prefer. You can also link this PR as one of the examples of what can be done.

nazar-pc commented 1 year ago

Long-term we need a completely different telemetry implementation anyway.