Tencent / TBase

TBase is an enterprise-level distributed HTAP database. Through a single database cluster to provide users with highly consistent distributed database services and high-performance data warehouse services, a set of integrated enterprise-level solutions is formed.
Other
1.38k stars 262 forks source link

Getting random `global xid is corrupted given len 51 gxid len8` errors at random queries. #103

Closed yazun closed 2 years ago

yazun commented 3 years ago

With the v 2.2.0 we started to see worrying errors, the query works always ok when issued without copy:

# psql .. -c "copy ( qry ) to stdout with csv header " | wc -l
ERROR:  node:datanode11, backend_pid:47391, nodename:datanode1,backend_pid:6630,message:global xid is corrupted given len 51 gxid len8

(repeated 2-9 times) with the same error, then eventually it works:

# psql .. -c "copy ( qry ) to stdout with csv header " | wc -l
272657

Any idea what could be happening? What other info we could provide? Thanks

yazun commented 3 years ago

It actually also happens for regular queries.

select sum(length) from ts;
ERROR:  node:datanode12, backend_pid:24061, nodename:datanode5,backend_pid:34335,message:global xid is corrupted given len 48 gxid len9
Time: 52.768 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sum(length) from ts;
ERROR:  node:datanode8, backend_pid:13942, nodename:datanode2,backend_pid:37623,message:global xid is corrupted given len 48 gxid len9
Time: 36.913 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sum(length) from ts;
ERROR:  node:datanode8, backend_pid:13942, nodename:datanode3,backend_pid:27589,message:global xid is corrupted given len 48 gxid len9
Time: 37.058 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sum(length) from ts;
ERROR:  node:datanode8, backend_pid:13942, nodename:datanode3,backend_pid:27589,message:global xid is corrupted given len 48 gxid len9
Time: 36.571 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sum(length) from ts;
ERROR:  node:datanode8, backend_pid:13942, nodename:datanode3,backend_pid:27589,message:global xid is corrupted given len 48 gxid len9
Time: 35.325 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sum(length) from ts;
ERROR:  node:datanode8, backend_pid:13942, nodename:datanode3,backend_pid:27589,message:global xid is corrupted given len 48 gxid len9
Time: 37.105 ms

Then connecting to a different coord helps. It looks like ver 2.2.0 regression...

yazun commented 3 years ago

Unfortunately this problem occurs more and more often for quite innocent queries. Any idea? i.e

(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode6, backend_pid:9908, nodename:datanode4,backend_pid:7535,message:global xid is corrupted given len 53 gxid len9
Time: 32.294 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 27.912 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 26.062 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 26.927 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 25.664 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode3, backend_pid:18974, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 24.320 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 27.258 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 27.624 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 27.683 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 26.231 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode3, backend_pid:18974, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 24.936 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len9
Time: 26.833 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode3,backend_pid:18974,message:global xid is corrupted given len 53 gxid len9
Time: 25.697 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode3,backend_pid:18974,message:global xid is corrupted given len 53 gxid len9
Time: 25.583 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode3,backend_pid:18974,message:global xid is corrupted given len 53 gxid len9
Time: 26.417 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode5, backend_pid:34913, nodename:datanode3,backend_pid:18974,message:global xid is corrupted given len 53 gxid len9
Time: 24.796 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode3,backend_pid:18974,message:global xid is corrupted given len 53 gxid len10
Time: 25.981 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > select sourceid,classification_types,sostypes from dr3_ops_cs36_mv.final_dr3_export_helper where inClassification and inSosAgn and not('{AGN}' <@ classification_types);
ERROR:  node:datanode4, backend_pid:7535, nodename:datanode2,backend_pid:9042,message:global xid is corrupted given len 53 gxid len10

changing the coordinator helps.

Dontpushme commented 3 years ago

it seems a problem about parallel workers, try

set max_parallel_workers_per_gather to 0;
yazun commented 3 years ago

We have it set to 10, it will impact performance quite a lot. Is it understood why it happens?

yazun commented 3 years ago

and can confirm switching off parallel workers stops crashing (at the cost of speed of course).

Dontpushme commented 3 years ago

and can confirm switching off parallel workers stops crashing (at the cost of speed of course).

I'll stick on this issue and please try different value of the GUC to see what will happen

Dontpushme commented 3 years ago

Oh, set enable_distri_debug_print to on, and report us the log on DN around the error message

Dontpushme commented 2 years ago

it should be fixed ~

yazun commented 2 years ago

Yes, confirm we do not reproduce it. Thank a lot.