Including complete snapshot message into SharedSnapshot

weinan003 commented 2 months ago

In our MPP query, the snapshot is synchronized by SharedSnapshot between QD/entrydb, QE writer gang/reader gang. The SharedSnapshot slot array is a range of share memory which is allocated in cluster instance start. Each session need to keep a slot until the session finished. Each snapshot includes distributed txn, local txn and also sub txn information. If all these infor save in shared memory, the array size is over 2GiB. For saving memory, the sub txn has not been included. Unfortunately, a data loss bug is introduced by this design.

In session a:
1. start a txn block (suppose txnid is 100) and using savepoint trap
   into sub txn (suppse txnid is 101).
2. create a new table

In session b:
3. trigger a command using entrydb to scan pg_class(suppose txnid is 102)
(e.g. `create table xxx select pg_class full join pg_class`).

4. drop the table and execute step 3 again (suppose txnid is 103 and 104)

In session a;
5: commit txn

The transaction is commit successful but the new table which was
created in the txn is loss.

The reason is that: In session b step 4, the entrydb's snapshot is fetch from SharedSnapshot slot which is not including sub txn message. When the heapam fetch the tuple from the pg_class, it check each tuple's MVCC by the snapshot. The new table's record xmin is 101, the snapshot xmin/xmax is 100/104. Since txnid 101 can not find in snapshot either xip or subxip, and the clog do not have the commit log, MVCC validate the 101 transaction is Abort. HEAP_XMIN_INVALID a heap tuple level MVCC status bit is set to the pg_class record tuple.

After the session b scan, the record tuple sentence died.

To fix the bug, SharedSnapshot need to introduce sub txn back. For saving shared memory, the SharedSnapshot slot keep the snapshot into DSM.

Each SharedSnapshot slot hold a dsm_handle as the access point to the snapshot dsm. At the first use of a slot, the dsm is create, and the size is snapshot serialized size. The dsm is called dsm_pin_segment and dsm_pin_mapping, so its lifecycle is same as postmaster process. The dsm only destory and recreate again if new snapshot size is larger than dsm size.

fix #ISSUE_Number

Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

[ ] Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
[ ] Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
[ ] Learn the coding contribution guide, including our code conventions, workflow and more.
[ ] List your communication in the GitHub Issues or Discussions (if has or needed).
[ ] Document changes.
[ ] Add tests for the change
[ ] Pass make installcheck
[ ] Pass make -C src/test installcheck-cbdb-parallel
[ ] Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

CLAassistant commented 2 months ago

All committers have signed the CLA.

avamingli commented 2 months ago

Nice Catch!

apache / cloudberry