citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.65k stars 670 forks source link

[Bug Fix] Query on distributed tables with window partition may cause segfault #7705 #7718

Closed colm-mchugh closed 1 week ago

colm-mchugh commented 3 weeks ago

This PR is a proposed fix for issue 7705. The following is the background and rationale for the fix (please refer to 7705 for context);

The varnullingrelsfield was introduced to the Var node struct definition in Postgres 16. Its purpose is to associate a variable with the set of outer join relations that can cause the variable to be NULL. The varnullingrels for the variable "gianluca_camp_test"."start_timestamp" in the problem query is 3, because the variable "gianluca_camp_test"."start_timestamp" is coming from the inner (nullable) side of an outer join and 3 is the RT index (aka relid) of that outer join. The problem occurs when the Postgres planner attempts to plan the combine query. The format of a combine query is:

SELECT <targets> 
FROM   pg_catalog.citus_extradata_container();

There is only one relation in a combine query, so no outer joins are present, but the non-empty varnullingrelsfield causes the Postgres planner to access structures for a non-existent relation. The source of the problem is that, when creating the target list for the combine query, function MasterAggregateMutator() uses copyObject() to construct a Var node before setting the master table ID, and this copies over the non-empty varnullingrels field in the case of the "gianluca_camp_test"."start_timestamp" var. The proposed solution is to have MasterAggregateMutator() use makeVar() instead of copyObject(), and only set the fields that make sense for the combine query; var type, collation and type modifier. The varnullingrelsfield can be left empty because there is only one relation in the combine query.

A new regress test issue_7705.sql is added to exercise the fix. The issue is not specific to window functions, any target expression that cannot be pushed down and contains at least one column from the inner side of a left outer join (so has a non-empty varnullingrels field) can cause the same issue.

More about Citus combine queries here. More about Postgres varnullingrels here.

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.69%. Comparing base (f695971) to head (c52f360). Report is 2 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7718 +/- ## ======================================= Coverage 89.69% 89.69% ======================================= Files 283 283 Lines 60510 60510 Branches 7541 7541 ======================================= Hits 54276 54276 Misses 4079 4079 Partials 2155 2155 ```
agedemenli commented 3 weeks ago

https://github.com/citusdata/citus/actions/runs/11615097946/job/32345026099?pr=7718

The test seems to be flaky. Please check the link. The flaky check runs the edited/added test file many times to make sure its output is the same in every run. It is usually because either some clause like 'ORDER BY` is missing; or the behavior itself is flaky.

colm-mchugh commented 3 weeks ago

The test seems to be flaky... It is usually because either some clause like 'ORDER BY` is missing; or the behavior itself is flaky.

Thanks @agedemenli , a couple of queries in the regress test need to be modified to ensure stable output.

In addition to the flaky tests, there are 4 failures in check-arbitrary-configs tests, all with an error like the following; this appears to be environmental? Should re-submitting the PR be ok to address these?

The self-hosted runner: c6608c3fc000002 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
agedemenli commented 3 weeks ago

Arbitrary configs check seems to be ok. For the failing style check, please refer to our coding conventions If you have any issues, feel free to reach out to me

thanodnl commented 2 weeks ago

please consider if this needs to be backported, and if we can cut a patch release sometime soon (tm) given that it is a cx reported segfault.