apache / cloudberry

One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
https://cloudberry.apache.org
Apache License 2.0
430 stars 104 forks source link

Including complete snapshot message into SharedSnapshot #613

Closed weinan003 closed 2 months ago

weinan003 commented 2 months ago

In our MPP query, the snapshot is synchronized by SharedSnapshot between QD/entrydb, QE writer gang/reader gang. The SharedSnapshot slot array is a range of share memory which is allocated in cluster instance start. Each session need to keep a slot until the session finished. Each snapshot includes distributed txn, local txn and also sub txn information. If all these infor save in shared memory, the array size is over 2GiB. For saving memory, the sub txn has not been included. Unfortunately, a data loss bug is introduced by this design.

In session a:
1. start a txn block (suppose txnid is 100) and using savepoint trap
   into sub txn (suppse txnid is 101).
2. create a new table

In session b:
3. trigger a command using entrydb to scan pg_class(suppose txnid is 102)
(e.g. `create table xxx select pg_class full join pg_class`).

4. drop the table and execute step 3 again (suppose txnid is 103 and 104)

In session a;
5: commit txn

The transaction is commit successful but the new table which was
created in the txn is loss.

The reason is that: In session b step 4, the entrydb's snapshot is fetch from SharedSnapshot slot which is not including sub txn message. When the heapam fetch the tuple from the pg_class, it check each tuple's MVCC by the snapshot. The new table's record xmin is 101, the snapshot xmin/xmax is 100/104. Since txnid 101 can not find in snapshot either xip or subxip, and the clog do not have the commit log, MVCC validate the 101 transaction is Abort. HEAP_XMIN_INVALID a heap tuple level MVCC status bit is set to the pg_class record tuple.

After the session b scan, the record tuple sentence died.

To fix the bug, SharedSnapshot need to introduce sub txn back. For saving shared memory, the SharedSnapshot slot keep the snapshot into DSM.

Each SharedSnapshot slot hold a dsm_handle as the access point to the snapshot dsm. At the first use of a slot, the dsm is create, and the size is snapshot serialized size. The dsm is called dsm_pin_segment and dsm_pin_mapping, so its lifecycle is same as postmaster process. The dsm only destory and recreate again if new snapshot size is larger than dsm size.

fix #ISSUE_Number


Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

CLAassistant commented 2 months ago

CLA assistant check
All committers have signed the CLA.

avamingli commented 2 months ago

Nice Catch!