Closed yquansah closed 3 months ago
Hey @yquansah, thanks for raising this
I can't think of anything off the top of my head, but I'll play about this this later today to try and reproduce
I will try and dig into the gossip protocol tomorrow
I'd be surprised if it was caused by gossip, which only gossips around what upstreams are connected, but won't (or at least shouldn't) close connections
@andydunstall Yeah nice. I am also going to play around with it later this afternoon. Thanks!
Hey @yquansah, thanks for the above example, I'm able to reproduce
So when you run with multiple nodes, if you connect to node N1 but the upstream is connected to N2, then N1 forwards all traffic to N2
Since Piko forward connects to the server using WebSockets (to wrap the underlying TCP connection and work with a HTTP load balancer), if N1 receives the WebSocket connection that it needs to forward to N2, it falls back to using the HTTP proxy to forward the WebSocket to N2
However by default the HTTP proxy has a timeout of 30 seconds (--proxy.timeout
), meaning N1 closes the connection to N2 after 30 seconds, meaning the end-to-end connection is dropped. So the quick fix is to set --proxy.timeout 0
to disable the timeout, and I'll think about a better way to handle this tomorrow (the timeout should only apply to HTTP requests not hijacked WebSocket connections, as the same issue applies to using the HTTP proxy as normal with a WebSocket client)
@andydunstall Really great information here, I appreciate the sleuthing, and it definitely makes sense. We'll go with the proxy timeout for now as you suggested for a quick fix.
hey, we gave this a whirl and are getting this error when the container starts:
config: proxy: missing timeout
here is the yaml for the deployment:
30 │ template:
1 │ │ metadata:
2 │ │ │ annotations:
3 │ │ │ │ kubectl.kubernetes.io/restartedAt: "2024-07-30T13:05:09-07:00"
4 │ │ │ creationTimestamp: null
5 │ │ │ labels:
6 │ │ │ │ app: piko
7 │ │ spec:
8 │ │ │ containers:
9 │ │ │ - args:
10 │ │ │ │ - server
11 │ │ │ │ - --config.path
12 │ │ │ │ - /config/server.yaml
13 │ │ │ │ - --config.expand-env
14 │ │ │ │ - --proxy.timeout
15 │ │ │ │ - "0"
perhaps I'm doing something wrong here?
config: proxy: missing timeout
Ah sorry I forgot its required, I'll add a patch now
A quick work around is just set a very large timeout (such as --proxy.timeout=24h
)
(Sorry I haven't had much time to fix the underlying issue, I'll try and get to soon)
I've merged https://github.com/andydunstall/piko/pull/151 to allow a timeout of 0, so if you re-build main it should be ok
@andydunstall Quick one. Thank you!
awesome! could I trouble you for a tagged image @andydunstall ? right now we are pulling piko:v0.6.0
, but would normally not want to pull :latest
in our architecture.
yep sure will tag v0.7.0 v0.6.1
@condaatje Thats done: ghcr.io/andydunstall/piko:v0.6.1
awesome thanks!
by the way - if you'd like to join our alpha (launching today) we'd be honored to have you! especially because we've absolutely loved working with piko so far: https://x.com/hyperbolic_labs/status/1823779096650015026
Thanks! I don't have anything I could use a GPU for, but the product looks great!
I've merged https://github.com/andydunstall/piko/pull/155, which only applies --proxy.timeout
to non-websocket (and non-TCP) connections, so you shouldn't need to set --proxy.timeout=0
anymore (sorry it took a while for me to get to)
So I'll close this for now. Let me know if theres anything else I can help with!
appreciated!
and this is great - could I trouble you for a v0.6.2
tag?
@condaatje sure tagged
thanks!
It seems as though when I run
piko
as aStatefulSet
usingtcp
, I am running into random connection closures (from the agent?).These are the logs I see from the
agent
:One interesting thing I noticed as that the connection closures seem to happen at a regular interval from the time a connection is opened (30s). This does not happen when I just run 1 replica of the server interestingly enough, only when I run more than 1 using the gossip protocol.
Here is a repro config, running the workload on kubernetes (minikube cluster)...
I will try and dig into the
gossip
protocol tomorrow, but just wanted to raise this issue in case there were any quick hints from your end @andydunstall.Note
The image names would need to be changed from the above config. The images there were just from my local Docker build.