OPT Speed Up Seeding - Githubissues

PIG208 commented 2 years ago

Every time we run the tests for the API server, we re-populate the database using api/anubis/rpc/seed.py. This takes around 100s on my machine (Intel i7-10750H), which is kind of slow. Apparently writing a lot of data into the db is an I/O intensive task. The benchmark shows that we spend most of the time initializing the submissions. This is probably tolerable when we only want to make a local deployment, but it will be quite annoying if we expand the test cases. (Another motive for this is to make tests that potentially rely on a fresh database viable.)

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
4560    0.189    0.000   46.381    0.010 submissions.py:308(init_submission)
...
5       0.078    0.016   92.659   18.532 seed.py:202(init_submissions)

Unless we can make seeding optional (or partially optional) and only reinitialize part of it that is necessary for the test, we need to think of a way to speed up this script (namely the init_submissions function). We can

Skip sqlalchemy and populating the data with some clever lines of SQL;
or generate a dump and reset the db whenever it is needed, and re-generate it if any changes are made to the model or the test setup (of course it will be gitignored, but we will need to figure a way to detect these changes).
Split the seeding process into multiple stages.
Reduce the number of test users (pretty straight-forward and effective)

wabscale commented 2 years ago

That is interesting that it takes so long for you. It only takes my machine about 5-7 seconds to run (though I have a overclocked desktop processor).

I think we should not completely abandon formatting the seed in sqlalchemy as it is much easier to update the current seed functions if we change something in the schema. What you describe where we run a seed, then generate a sql dump sounds very good to me. We can then commit the seed sql dump to the repo so new people do not need to generate them. I'm thinking we could even gzip the sql dumps to get them even smaller. The seed endpoints can just run the sql dump.

Do you want to take this on, or should I?

PIG208 commented 2 years ago

I can work on this this weekend

AnubisLMS / Anubis

OPT Speed Up Seeding #286