Vlad-Shcherbina / icfpc2018-tbd

1 stars 0 forks source link

Cluster management #30

Open Vlad-Shcherbina opened 6 years ago

Vlad-Shcherbina commented 6 years ago

We had natural language interface backed by Yorick and Yegor. You just say in #ops channel what you want to run and how many instances, and it gets run. There were about 5 high-end machines in the cluster.

It's unlikely this approach would work for, say, 10 or more machines.

Really no idea what should be done here.

One possibility is to set up some tool that allows any team member to automatically launch, monitor and cancel jobs. Sounds extremely complicated. And still requires somebody to babysit the machines and resolve unexpected problems. The tool could be either existing one (hard to master, full of defects, and with tons of unneeded functionality) or custom-made (hard to master, full of defects, and lacking essential functionality).

What do existing solutions look like?

Another alternative is to have natural language interface backed by an operator that uses real cluster management software instead of lots of terminals opened into each machine. The problem here is that this whole thing would be dependent on the person with very specialized skills to be available for the contest duration.


When considering possible approaches keep in mind that it's likely, but not given, that we'll need lots of computing power next year. The nature of the problem is always different. So if your solution requires costly preparations, allow for the chance that it will be in vain.

earthdok commented 6 years ago

There was an #ops channel?

Vlad-Shcherbina commented 6 years ago

Oops. It did not occur to me that it was not easily discoverable because I was invited into it from the start. We really need to sort out this mess.

earthdok commented 6 years ago

Had there been a dedicated channel for announcing new features, like fj suggests, it could have been announced there.

Now to make sure everyone is invited to THAT channel...

earthdok commented 6 years ago

So wait, that time I was frantically running pillar_solver on my macbook 10 minutes before the lightning round deadline. Was it before or after we got a cluster?

Vlad-Shcherbina commented 6 years ago

I don't think they were online for the end of the lightning round, and the machines were rented only for Sunday and Monday.

But in general, yeah, it would be prudent to have some machines on stand by few hours before the lightning deadline, when the pieces begin to come together for the first time.