Fix CI failing on node16

I write here some notes about the things that we discussed, just for reference.

A general note about Playwright: most of the time when it says "timeout error" it is not really a "timeout error". Indeed Playwright by default waits for a condition to be as expected by the test (e.g. 3 buttons are present in the page) and if it that condition is not match after a while it throws errors like Timed out 5000ms waiting for expect(locator).toHaveCount(expected).

We have always had some flaky tests. Currently in playwright config we have set retries: 3, meaning that a test can be attempted again for a maximum of 3 times before being considered failed. A test is marked as "flaky" if a it fails but then another attempt is successful. For some weird conditions a test that was often flaky started to failing always on node 16 this Monday.

That particular test was failing because it was based on the fake job server, that is a simple server started during the tests to simulate the execution of a job through a task executing the fake-task.sh script (that calls the fake job server to check if it has to fail or not). The server allows to trigger job success or failure in specific moments (playwright performs a PUT call to the server to change the job status when needed). In certain tests, especially when you have to test jobs in the submitted state, this approach provides finer control than using a generic_task, which terminates after a given sleep time. This test server was developed when we didn't use parallel tests so it wasn't written to handle parallel jobs. I changed that in #560, considering concurrent jobs. This will reduce the possibility to have flaky tests when using the fake server.

In other cases flaky tests seem to be caused by some backend errors in the local_experimental_executor, which fails to terminate some jobs, leaving them in a submitted state. See https://github.com/fractal-analytics-platform/fractal-server/issues/1772.

Another weird thing that I've mitigated with #560 is something that happens sporadically when running in v2/run_mock_tasks.spec.js, mostly with Chrome, that is faster.

That test waits in the workflow page that 2 submitted tasks are completed (it detects the 2 green checkboxes), then it clicks on the "List jobs" button and checks the content of the jobs table. Sometimes it happens that the job tables contains the last job still in the submitted status, even if in the previous workflow page it was marked as completed. I added a check for this condition that waits for the ending of the job also in the job list page before performing the next steps. If this condition appears the test prints the following warning message: WARNING: waiting for the completion of a job that should have already been completed!.

I suspect that the cause is a misalignment in the responses provided by the following endpoints:

/v2/project/{project_id}/status?dataset_id={dataset_id}&workflow_id={workflow_id}: used by the workflow page
/v2/project/{project_id}/workflow/{workflow_id}/job: used by the jobs list page

I assume that something is not being updated atomically or we are encountering some cache issues (I'm not talking about the browser cache, but maybe something in SQLAlchemy - I'm speculating). @tcompa What do you think?

fractal-analytics-platform / fractal-web

Fix CI failing on node16 #558