Kitware / CDash

An open source, web-based software testing server
http://www.cdash.org/
Other
215 stars 76 forks source link

Upgrade from 3.5.1 to 3.6.0 never finishes migration steps #2545

Open martin5233 opened 2 days ago

martin5233 commented 2 days ago

Bug report

Expected Behavior

Upgrading from 3.5.1 to 3.6.0 should run necessary database migrations and then start normal work

Actual Behavior

The migration steps never seem to finish. Specifically the DB query

            DELETE FROM testmeasurement
            WHERE
                testid IS NULL OR
                testid NOT IN (
                    SELECT id from build2test

takes more than 20 hours, after which I aborted the process. Just for reference: Our testmeasurement table contains about 3 million rows and the build2test table contains about 2 million rows. We run about 20,000 tests on 3 platforms per night. It seems, that this query is extremely slow, Postgres continually consumes 100% CPU with no I/O.

CDash Version

3.5.1 -> 3.6.0

Additional Information

Logging does not show any helpful information.

williamjallen commented 2 days ago

Whenever a query is slow, it's helpful to take a look at the query plan. Please post the output of the following:

EXPLAIN DELETE FROM testmeasurement WHERE testid IS NULL OR testid NOT IN (SELECT id from build2test)
martin5233 commented 2 days ago

@williamjallen: Thanks for the quick reply. I tried to do exactly that, but it fails, because testid is not known. From my understanding this is caused by the migration script 2024_07_09_025240_find_test_measurements_by_testid.php consists of multiple steps, where the second (which is the problematic one) builds upon the first, which adds this column.

williamjallen commented 2 days ago

Ah, good point. Just a guess, but you might try adding ->index() to the command where the testid column is added to see if that helps Postgres figure out which things need to be deleted more efficiently.

martin5233 commented 1 day ago

It's somewhat difficult to do this inside the docker container, as the migration starts as soon as I start up the container. I would need to rebuild the docker image with changed source code for the migration step. This is something, that I don't want to do. We decided to start from scratch instead of upgrading from 3.5.1, so it's not an issue for me anymore. But anyway you might want to look into it, as I'm probably not the only, where this upgrade may take ages.

bilke commented 1 day ago

I have the same issue. @williamjallen I have added ->index() like so:

diff --git a/database/migrations/2024_07_09_025240_find_test_measurements_by_testid.php b/database/migrations/2024_07_09_025240_find_test_measurements_by_testid.php
index 0d24bf246..0ef5ec291 100644
--- a/database/migrations/2024_07_09_025240_find_test_measurements_by_testid.php
+++ b/database/migrations/2024_07_09_025240_find_test_measurements_by_testid.php
@@ -32,7 +32,7 @@ return new class extends Migration {
                 build2test.id
             FROM testmeasurement
             JOIN build2test ON testmeasurement.outputid = build2test.outputid
-        ');
+        ')->index();

         // Delete any entries which will fail the FK constraint (including the old values which now have a null testid)
         DB::delete('

But this does work as DB::insert() returns a bool:

cdash  |    INFO  Running migrations.
cdash  |
cdash  |   2024_07_09_025240_find_test_measurements_by_testid    INFO  No scheduled commands are ready to run.
cdash  |
cdash  | ........... 42,851ms FAIL
cdash  |
cdash  | In 2024_07_09_025240_find_test_measurements_by_testid.php line 35:
cdash  |
cdash  |   Call to a member function index() on bool

I guess you meant adding ->index() to something else?

williamjallen commented 1 day ago

@bilke The ->index() should be added to line 16 of that migration where the column is added.

bilke commented 1 day ago

Thanks, new error:

cdash  | Running migrations...
cdash  |
cdash  |    INFO  Running migrations.
cdash  |
cdash  |   2024_07_09_025240_find_test_measurements_by_testid ............... 14ms FAIL
cdash  |
cdash  | In Connection.php line 829:
cdash  |
cdash  |   SQLSTATE[42P07]: Duplicate table: 7 ERROR:  relation "testmeasurement_testi
cdash  |   d_index" already exists (Connection: pgsql, SQL: create index "testmeasurem
cdash  |   ent_testid_index" on "testmeasurement" ("testid"))
cdash  |
cdash  |
cdash  | In Connection.php line 587:
cdash  |
cdash  |   SQLSTATE[42P07]: Duplicate table: 7 ERROR:  relation "testmeasurement_testi
cdash  |   d_index" already exists
cdash  |
stefankaufmann commented 20 hours ago

We ran into the same issue. We have about 1.5 million entries in build2test and the migration did not end after several hours. According to https://stackoverflow.com/a/19364694/12490545 the performance of "NOT IN" queries is pretty bad.

I opened a pull request with a fix which worked for us #2552