arenadata / gpdb

Arenadata DB
https://docs.arenadata.io/en/ADB/current/introduction/intro.html
Apache License 2.0
40 stars 22 forks source link

ADBDEV-5310: Align fault injector behavior when target table provided #1086

Closed mos65o2 closed 1 day ago

mos65o2 commented 2 weeks ago

Align fault injector behavior when target table provided

Fixed injector behavior thanks to commit 9e73db3 (cherry-picked) to better handle tableName argument. The PR aligns the definition of insertbmlov*_freeze fault injectors between 6X and 7X. Fixes the replacement with SIMPLE_FAULT_INJECTOR in 1502bef.


Commits in the PR shouldn't be squashed to preserve authorship.

BenderArenadata commented 2 weeks ago

Allure report https://allure.adsw.io/launch/83046

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83260

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83503

RekGRpth commented 1 week ago

If the last commit is a cherry pick commit, then where are most of the tests?!

RekGRpth commented 1 week ago

Why doesn't revert revert everything?!

mos65o2 commented 1 week ago

Why doesn't revert revert everything?!

Because the test and the test in 7 and 6 are different. The test from 7 will not work in 6 without changes. If I revert everything, where will I get the changes that are in 6? Do I have to edit it manually?

RekGRpth commented 1 week ago

Why doesn't revert revert everything?!

Because the test and the test in 7 and 6 are different. The test from 7 will not work in 6 without changes. If I revert everything, where will I get the changes that are in 6? Do I have to edit it manually?

You should first completely revert commit 1502bef, and then cherry-pick commit 8accc02 again, making the necessary changes (as I did for commit 1502bef, only you will have fewer changes).

mos65o2 commented 1 week ago

You should first completely revert commit 1502bef, and then cherry-pick commit 8accc02 again, making the necessary changes (as I did for commit 1502bef, only you will have fewer changes).

I need to replace every "pg_switch_wal()" and others. What is the reason to re-edit the test manually if you have already made these changes?

diff 8accc02 1502bef

diff --git a/src/test/isolation2/sql/frozen_insert_crash.sql b/src/test/isolation2/sql/frozen_insert_crash.sql
index 97e6728e8e..89cf7fb2e7 100644
--- a/src/test/isolation2/sql/frozen_insert_crash.sql
+++ b/src/test/isolation2/sql/frozen_insert_crash.sql
@@ -15,18 +15,21 @@
 1: create table tab_fi(a int) with (appendoptimized=true) distributed replicated;

 -- switch WAL on seg0 to reduce flakiness
-1: select gp_segment_id, pg_switch_wal() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+1: select gp_segment_id, pg_switch_xlog() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+
+2: insert into tab_fi select generate_series(1, 1000);
+2: update tab_fi set a = 1 where a > 10;

 -- suspend right after the insert into gp_fastsequence during an AO table insert,
 -- but before freezing the tuple
 1: select gp_inject_fault('insert_fastsequence_before_freeze', 'suspend', ''/*DDL*/, ''/*DB*/, 'gp_fastsequence'/*table*/, 1/*start occur*/, 1/*end occur*/, 0/*extra_arg*/, dbid) from gp_segment_configuration where role = 'p' and content = 0;

-2>: insert into tab_fi values(1);
+2>: vacuum tab_fi;

 1: select gp_wait_until_triggered_fault('insert_fastsequence_before_freeze', 1, dbid) from gp_segment_configuration where role = 'p' and content = 0;

 -- switch WAL on seg0, so the new row gets flushed (including its index)
-1: select gp_segment_id, pg_switch_wal() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+1: select gp_segment_id, pg_switch_xlog() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;

 -- inject a panic, and resume the insert. The WAL for the freeze operation is not
 -- going to be made to disk (we just flushed WALs), so we won't replay it during restart later.
@@ -58,23 +61,26 @@
 1: create table tab_fi(a int) with (appendoptimized=true) distributed replicated;

 -- switch WAL on seg0 to reduce flakiness
-1: select gp_segment_id, pg_switch_wal() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+1: select gp_segment_id, pg_switch_xlog() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+
+2: insert into tab_fi select generate_series(1, 1000);
+2: update tab_fi set a = 1 where a > 10;

 -- suspend right after freezing the tuple
 1: select gp_inject_fault('insert_fastsequence_after_freeze', 'suspend', ''/*DDL*/, ''/*DB*/, 'gp_fastsequence'/*table*/, 1/*start occur*/, 1/*end occur*/, 0/*extra_arg*/, dbid) from gp_segment_configuration where role = 'p' and content = 0;

-2>: insert into tab_fi values(1);
+2>: vacuum tab_fi;

 1: select gp_wait_until_triggered_fault('insert_fastsequence_after_freeze', 1, dbid) from gp_segment_configuration where role = 'p' and content = 0;

 -- switch WAL on seg0, so the freeze record gets flushed
-1: select gp_segment_id, pg_switch_wal() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+1: select gp_segment_id, pg_switch_xlog() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;

 -- While we are on it, check the wal record for the freeze operation.
 -- One of the purposes is to guard against unexpected addition to the xl_heap_freeze_tuple struct in future.
 -- If this test failed due to WAL size, please check to see if the xl_heap_freeze_tuple struct
 -- has changed, and if we should initialize any new field in heap_freeze_tuple_no_cutoff().
-! seg0_datadir=$(psql -At -c "select datadir from gp_segment_configuration where content = 0 and role = 'p'" postgres) && seg0_last_wal_file=$(psql -At -c "SELECT pg_walfile_name(pg_current_wal_lsn()) from gp_dist_random('gp_id') where gp_segment_id = 0" postgres) && pg_waldump ${seg0_last_wal_file} -p ${seg0_datadir}/pg_wal | grep FREEZE_PAGE;
+! seg0_datadir=$(psql -At -c "select datadir from gp_segment_configuration where content = 0 and role = 'p'" postgres) && seg0_last_wal_file=$(psql -At -c "SELECT pg_xlogfile_name(pg_current_xlog_location()) from gp_dist_random('gp_id') where gp_segment_id = 0" postgres) && pg_xlogdump ${seg0_last_wal_file} -p ${seg0_datadir}/pg_xlog | grep freeze_page;

 -- inject a panic and resume in same way as Case 1. But this time we will be able to replay the frozen insert.
 -- skip FTS probe to prevent unexpected mirror promotion
@@ -126,7 +132,7 @@ $$ LANGUAGE plpgsql;
 1: create index tab_fi_idx on tab_fi using bitmap(b);
 1: insert into tab_fi values(1, 1);
 -- switch WAL on seg0 to reduce flakiness
-1: select gp_segment_id, pg_switch_wal() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+1: select gp_segment_id, pg_switch_xlog() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;

 -- case 1: suspend and flush WAL before freezing the tuple

@@ -136,7 +142,7 @@ $$ LANGUAGE plpgsql;
 2>: insert into tab_fi values(2, 2);
 1: select gp_wait_until_triggered_fault('insert_bmlov_before_freeze', 1, dbid) from gp_segment_configuration where role = 'p' and content = 0;
 -- switch WAL on seg0, so the new row gets flushed (including its index)
-1: select gp_segment_id, pg_switch_wal() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+1: select gp_segment_id, pg_switch_xlog() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
 -- inject a panic, and resume the insert. The WAL for the freeze operation is not
 -- going to be made to disk (we just flushed WALs), so we won't replay it during restart later.
 -- skip FTS probe to prevent unexpected mirror promotion
@@ -165,15 +171,15 @@ $$ LANGUAGE plpgsql;
 1: create index tab_fi_idx on tab_fi using bitmap(b);
 1: insert into tab_fi values(1, 1);
 -- switch WAL on seg0 to reduce flakiness
-1: select gp_segment_id, pg_switch_wal() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+1: select gp_segment_id, pg_switch_xlog() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
 -- suspend right after freezing the tuple
 1: select gp_inject_fault('insert_bmlov_after_freeze', 'suspend', dbid) from gp_segment_configuration where role = 'p' and content = 0;
 2>: insert into tab_fi values(2, 2);
 1: select gp_wait_until_triggered_fault('insert_bmlov_after_freeze', 1, dbid) from gp_segment_configuration where role = 'p' and content = 0;
 -- switch WAL on seg0, so the freeze record gets flushed
-1: select gp_segment_id, pg_switch_wal() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
+1: select gp_segment_id, pg_switch_xlog() is not null from gp_dist_random('gp_id') where gp_segment_id = 0;
 -- While we are on it, check the wal record for the freeze operation.
-! seg0_datadir=$(psql -At -c "select datadir from gp_segment_configuration where content = 0 and role = 'p'" postgres) && seg0_last_wal_file=$(psql -At -c "SELECT pg_walfile_name(pg_current_wal_lsn()) from gp_dist_random('gp_id') where gp_segment_id = 0" postgres) && pg_waldump ${seg0_last_wal_file} -p ${seg0_datadir}/pg_wal | grep FREEZE_PAGE;
+! seg0_datadir=$(psql -At -c "select datadir from gp_segment_configuration where content = 0 and role = 'p'" postgres) && seg0_last_wal_file=$(psql -At -c "SELECT pg_xlogfile_name(pg_current_xlog_location()) from gp_dist_random('gp_id') where gp_segment_id = 0" postgres) && pg_xlogdump ${seg0_last_wal_file} -p ${seg0_datadir}/pg_xlog | grep freeze_page;
 -- inject a panic and resume in same way as Case 1. But this time we will be able to replay the frozen insert.
 -- skip FTS probe to prevent unexpected mirror promotion
 1: select gp_inject_fault_infinite('fts_probe', 'skip', dbid) from gp_segment_configuration where role='p' and content=-1;
RekGRpth commented 1 week ago

You don't need to do anything manually, especially if it has already been done, just use the ready-made solution. Now you have a problem only in the design, not in the solution.

mos65o2 commented 1 week ago

You don't need to do anything manually, especially if it has already been done, just use the ready-made solution. Now you have a problem only in the design, not in the solution.

Should this be present after the revert?

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83595

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83602

RekGRpth commented 1 week ago

change

Commit 1502bef introduced a new flaking test. The test expects a panic to occur

to

Commit 09742e9 introduced a new flaking test. The test expects a panic to occur

mos65o2 commented 1 week ago

change

Commit 1502bef introduced a new flaking test. The test expects a panic to occur

to

Commit 09742e9 introduced a new flaking test. The test expects a panic to occur

Done

RekGRpth commented 1 week ago

I think that it was necessary to first revert the commit 3d4cc61 before reverting the commit 1502bef.

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83634

mos65o2 commented 1 week ago

I think that it was necessary to first revert the commit 3d4cc61 before reverting the commit 1502bef.

No need. Reverting commit 1502bef undoes 3d4cc61's changes

RekGRpth commented 1 week ago

I think that it was necessary to first revert the commit 3d4cc61 before reverting the commit 1502bef.

No need. Reverting commit 1502bef undoes 3d4cc61's changes

The reversion must be conflict-free!

mos65o2 commented 1 week ago

The reversion must be conflict-free!

Revert 1502bef clearly describes the reversal of the relevant test cases involving commit 3d4cc61. Revert 3d4cc61 clutters the patch. Revert 1502bef has no major conflicts.

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83674

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83699

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83714

RekGRpth commented 1 week ago

Cherry picked from 8accc02

I'm not sure these words are appropriate here.

mos65o2 commented 1 week ago

Cherry picked from 8accc02

I'm not sure these words are appropriate here.

Why? It's a cherry-pick

RekGRpth commented 6 days ago

Cherry picked from 8accc02

I'm not sure these words are appropriate here.

Why? It's a cherry-pick

Cherry-picл is not formatted like this, it doesn't look like this, and most of the changes from the original commit are missing.

mos65o2 commented 6 days ago

Cherry picked from 8accc02

I'm not sure these words are appropriate here.

Why? It's a cherry-pick

Cherry-picл is not formatted like this, it doesn't look like this, and most of the changes from the original commit are missing.

This is just a reference to the original commit. Is this a serious omission? Misleading?

BenderArenadata commented 5 days ago

Allure report https://allure.adsw.io/launch/83989

BenderArenadata commented 5 days ago

Allure report https://allure.adsw.io/launch/84001

BenderArenadata commented 1 day ago

Allure report https://allure.adsw.io/launch/84144