Solution DB and solver worker

Vlad-Shcherbina commented 6 years ago

Maybe I'm biased, but I think overall it performed pretty well. Shared database architecture for the win!

By some amazing serendipity it was exactly the thing I was thinking of and prototyping before the contest. It still took a bit longer to get it working than expected.

Task distribution mechanism was simple and robust, but it allowed for occasional collisions that resulted in duplicate work. Initially I thought it would be completely insignificant, but turns out if some tasks are larger than others, they will collide much more often. I don't think too much resources were wasted on collisions (low tens of CPU-hours max), but it would be neater to have some task reservation mechanism. That adds complications though.

Another limitation: there was only distribution over problems, but not over solvers. It would be nice if one could run solver_worker <solver1> <solver2> ...

solver_runner (the tool for locally testing the implementations of the solver interface) was late and awkward addition. Perhaps it would be better to provide it first, even before the rest of the DB infrastructure is set up.

It seems the database was nowhere near its performance limits. And I was running the smallest possible instance (0.2 vCPU, 0.60 GB RAM, 10 GB HDD (not even SSD)). Once we ran into open connections limit (20), it was easy to mitigate by upgrading the instance. Currently all tables occupy ~1 GB.

The dashboard was a bit slow. That's because we had a lot of traces (18K for full problems), all displayed in one page. I don't think client-side rendering is the answer. I think hardcoded pagination is the answer: replace "full problems" page with three pages "FD", "FA", "FR", if that's not enough replace "FD" with "FD (R < 100)", "FD (R >= 100)", if that's not enough still start thinking about the real solution.

It would be nice to have the database backups (and the restore procedure tested in advance). I didn't bother with any unfortunately.

Vlad-Shcherbina commented 6 years ago

Another missing feature: after fixing a bug, rerunning the solver only on the problem instances where it failed before.

No idea how to implement it in a principled way.

fj128 commented 6 years ago

With a command line flag to the runner?

Vlad-Shcherbina commented 6 years ago

I mean, does this feature even make sense? If you fixed a bug, the behavior on other problems could change, shouldn't you just rerun on all problems? If you want to make sure that the failure does not reoccur in a specific case, maybe it's better to add a way to specify concrete problem ids in the solver_runner.

kevroletin commented 6 years ago

One more issue I want to mention in the context of the worker.

When I started working on solver, I used pyjs emulator to run first 50 (small) problems and quickly see execution statuses (e.g. Pass/Fail). Later I started using the runner script which did almost the same but It executed big problems first. It would be nice to have the ability to choose arbitrary problems or ask the runner to run problems starting from the smallest one.

Vlad-Shcherbina commented 6 years ago

Yes, that's the problem. I was primarily thinking about how to run many mostly finished solvers in a distributed way, and not how to make the development process convenient.

Another QoL feature would be to run the work-in-progress solver locally, without polluting the DB, but at the same time somehow compare these results to the prevoius recorded runs of this solver.

Vlad-Shcherbina / icfpc2018-tbd

Solution DB and solver worker #31