Req: Documentation request for scaling in production

criyle / go-judge

Sandbox Server in REST / gRPC API. Based on Linux container technologies.

MIT License

419 stars 67 forks source link

Req: Documentation request for scaling in production #95

Open maradanasai opened 9 months ago

maradanasai commented 9 months ago

Hi, this is amazing and liked a lot. Can someone provide an architecture or documentation on how to use this in production with scale?

undefined-moe commented 9 months ago

As those sandbox runner interact with local filesystem (file cache not shared across multiple instances), i would suggest running a controller along with a single sandbox daemon on each machine, and those controllers then connects to a master that deals with task distribution.

maradanasai commented 9 months ago

Hi @undefined-moe thanks for getting back. Can you please elaborate with couple of options on low level with more details?

If I can run multiple go-judge instances as runners among the multiple VMs in cloud and distributing the incoming submissions traffic (using messaging queue or load balancer), how to deal with response events?

I would like to provide the real time updates about the execution of test cases to User when submitted the program. Can you please provide your thoughts on this in detail?

undefined-moe commented 9 months ago

There are two ways of splitting tasks,

by judge task
by testcase

For the first way, a single judge task is sticked on the same mechine so that their would be lower cost to transfer compiled binaries across vms, pseudo code below:

ws = new WebSocket(masterAddr);
ws.onmessage = (msg) => {
  task = parseTask(msg);
  result = compile(task);
  ws.send(result);
  testcases.forEachParallel(() => {
    result = runProgram();
    ws.send(result);
  });
}

You can also check detailed implementation here:

JudgeClientAdapter https://github.com/hydro-dev/Hydro/blob/master/packages/hydrojudge/src/hosts/hydro.ts
JudgeFlow https://github.com/hydro-dev/Hydro/blob/master/packages/hydrojudge/src/flow.ts
ServerSide https://github.com/hydro-dev/Hydro/blob/master/packages/hydrooj/src/handler/judge.ts

p.s. you have to run a controller client on each machine, only responsible for controlling the go-judge daemon on that machine (e.g. manage task state, download testdata, etc).

criyle commented 9 months ago

It was not designed to be used behind a load balancer since it has local cache which is stateful. Since transmitting files was considered as a rather expensive operation, it is recommend to deploy this as a sidecar with your controller application which split a request into multiple subsequent sandbox calls.

If you insist on load balancer and you can bear the cost of transmitting files over the network. I would recommend to mount shared file system (e.g. NFS) on all of your hosts and use -dir to point cache directory to your mount point in order to share the states over multiple hosts.

Alternatively you may implement a FileStore interface to your customized scalable implementation (e.g. s3), but you need to keep in mind that managing a separated infrastructure or using cloud services comes with cost.

maradanasai commented 9 months ago

Hi @criyle thanks for sharing it. Do you have a controller that is implemented for this? Can you please help with providing low level details and data flow on how to use this in production with scale?

criyle commented 9 months ago

You may check out the demo implementation that shows how judger deployed with sandbox, which receives the OJ task and processes compile and running calls to sandbox. In production environments like k8s, you can describe this combination as a pod and scale at pod level.