Open yazun opened 2 years ago
Confirmed:
It only happens when used in insert into ... select ...
context.
Thank you for your feedback, we will assign a developer to deal with this problem, and we will solve it for you as soon as possible
Much appreciated!
@yazun ,hello,enum NodeType 113 is mapped to T_GatherMergeState, maybe you used parallel mode, Can you see if max_parallel_workers_per_gather is 0?
Hello @LinChuangwei,
max_parallel_workers_per_gather
is globally set to 20 indeed, but the plan does not show parallel execution, so not sure it should be a problem here?
It may be that the plan changes at some time,you can try to set max_parallel_workers_per_gather to 0. In addition, can you help set log_error_verbosity to verbose and see the specific print location from the pg_log when an error occurs? thank you very much
Yes, indeed you were right, setting max_parallel_workers_per_gather to 0 helped.
explain analyze
insert into dr3_ops_cs36_mv.final_dr3_animation_variables_exported_1
select sourceid,
sostypes,classification_types,
...
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Remote Subquery Scan on all (datanode1,datanode10,datanode11,datanode12,datanode2,datanode3,datanode4,datanode5,datanode6,datanode7,datanode8,datanode9) (cost=20479222.53..3368986147837810.00 rows=48044735794174 width=283) (actual time=18931.348..18931.348 rows=0 loops=1)
-> Insert on final_dr3_animation_variables_exported_1 (cost=20479222.53..3368986147837810.00 rows=48044735794174 width=283)
DN (actual startup time=18428.574..18484.696 total time=18428.574..18484.696 rows=0..0 loops=1..1)
-> Remote Subquery Scan on all (datanode11) (cost=20479222.53..3368986147837810.00 rows=48044735794174 width=283)
DN (actual startup time=1136.135..16936.424 total time=18413.332..18460.620 rows=197..231 loops=1..1)
Distribute results by S: sourceid
-> Nested Loop (cost=20479122.53..3285820710177995.00 rows=48044735794174 width=283)
DN (actual startup time=327.351..327.351 total time=17723.463..17723.463 rows=2582..2582 loops=1..1)
-> Remote Subquery Scan on all (datanode1,datanode10,datanode11,datanode12,datanode2,datanode3,datanode4,datanode5,datanode6,datanode7,datanode8,datanode9) (cost=20479020.78..117892842332.74 rows=35884647727 width=3281)
DN (actual startup time=12.791..12.791 total time=24.792..24.792 rows=2582..2582 loops=1..1)
-> Hash Join (cost=20478920.78..47659097.27 rows=35884647727 width=3281)
DN (actual startup time=7.841..7.841 total time=8.152..8.152 rows=231..231 loops=1..1)
Hash Cond: (final_dr3_export_helper.sourceid = final_dr3_run90002_flat_environment_10.sourceid)
-> Index Scan using uq_final_dr3_export_helper on final_dr3_export_helper (cost=0.11..2090176.13 rows=209 width=128)
DN (actual startup time=0.122..0.122 total time=0.291..0.291 rows=231..231 loops=1..1)
Index Cond: ((sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint))
-> Hash (cost=20468920.67..20468920.67 rows=1 width=3206)
DN (actual startup time=7.653..7.653 total time=7.653..7.653 rows=231..231 loops=1..1)
Buckets: 1024 Batches: 1 Memory Usage: 501kB
-> Nested Loop (cost=5.88..20468920.67 rows=1 width=3206)
DN (actual startup time=0.154..0.154 total time=6.971..6.971 rows=231..231 loops=1..1)
-> Nested Loop (cost=4.75..20458085.00 rows=1 width=1970)
DN (actual startup time=0.147..0.147 total time=5.985..5.985 rows=231..231 loops=1..1)
-> Nested Loop (cost=3.63..20447249.33 rows=1 width=993)
DN (actual startup time=0.122..0.122 total time=4.558..4.558 rows=231..231 loops=1..1)
-> Nested Loop (cost=2.51..20436413.88 rows=1 width=969)
DN (actual startup time=0.102..0.102 total time=3.310..3.310 rows=231..231 loops=1..1)
-> Merge Join (cost=1.37..20425578.19 rows=1 width=678)
DN (actual startup time=0.091..0.091 total time=2.125..2.125 rows=231..231 loops=1..1)
Merge Cond: (tsr_rp.sourceid = final_dr3_run90002_flat_environment_10.sourceid)
-> Merge Append (cost=0.12..9861340.98 rows=985 width=307)
DN (actual startup time=0.029..0.029 total time=0.369..0.369 rows=231..231 loops=1..1)
Sort Key: tsr_rp.sourceid
-> Index Scan using timeseriesresult_1_1_pkey on timeseriesresult_1_1 tsr_rp (cost=0.12..9861336.54 rows=985 width=307)
DN (actual startup time=0.025..0.025 total time=0.345..0.345 rows=231..231 loops=1..1)
Index Cond: ((runid = 3852) AND (catalogid = '2'::smallint) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = 336928))
-> Materialize (cost=1.25..10554234.28 rows=973 width=371)
DN (actual startup time=0.054..0.054 total time=1.594..1.594 rows=231..231 loops=1..1)
-> Nested Loop (cost=1.25..10554232.82 rows=973 width=371)
DN (actual startup time=0.045..0.045 total time=1.504..1.504 rows=231..231 loops=1..1)
-> Merge Append (cost=0.12..9741333.85 rows=973 width=307)
DN (actual startup time=0.005..0.005 total time=0.413..0.413 rows=231..231 loops=1..1)
Sort Key: tsr_bp.sourceid
-> Index Scan using timeseriesresult_1_1_pkey on timeseriesresult_1_1 tsr_bp (cost=0.12..9741329.46 rows=973 width=307)
DN (actual startup time=0.005..0.005 total time=0.385..0.385 rows=231..231 loops=1..1)
Index Cond: ((runid = 3852) AND (catalogid = '2'::smallint) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = 336911))
-> Bitmap Heap Scan on final_dr3_run90002_flat_environment_10 (cost=1.12..835.46 rows=1 width=64)
DN (actual startup time=0.003..0.003 total time=0.003..0.003 rows=1..1 loops=231..231)
Recheck Cond: ((sourceid = tsr_bp.sourceid) AND (sourceid >= '13677065656154881'::bigint))
-> Bitmap Index Scan on pk_final_dr3_run90002_flat_environment_10 (cost=0.00..1.12 rows=1 width=0)
DN (actual startup time=0.002..0.002 total time=0.002..0.002 rows=1..1 loops=231..231)
Index Cond: ((sourceid = tsr_bp.sourceid) AND (sourceid >= '13677065656154881'::bigint))
-> Append (cost=1.14..835.68 rows=1 width=307)
DN (actual startup time=0.005..0.005 total time=0.005..0.005 rows=1..1 loops=231..231)
-> Bitmap Heap Scan on timeseriesresult_1_1 tsr_g (cost=1.14..835.68 rows=1 width=307)
DN (actual startup time=0.003..0.003 total time=0.004..0.004 rows=1..1 loops=231..231)
Recheck Cond: ((runid = 3852) AND (catalogid = '2'::smallint) AND (sourceid = final_dr3_run90002_flat_environment_10.sourceid) AND (sourceid <= '20800599629677184'::bigint) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = 336935))
-> Bitmap Index Scan on timeseriesresult_1_1_pkey (cost=0.00..1.14 rows=1 width=0)
DN (actual startup time=0.002..0.002 total time=0.002..0.002 rows=1..1 loops=231..231)
Index Cond: ((runid = 3852) AND (catalogid = '2'::smallint) AND (sourceid = final_dr3_run90002_flat_environment_10.sourceid) AND (sourceid <= '20800599629677184'::bigint) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = 336935))
-> Append (cost=1.12..835.45 rows=1 width=26)
DN (actual startup time=0.005..0.005 total time=0.005..0.005 rows=1..1 loops=231..231)
-> Bitmap Heap Scan on source_1 s (cost=1.12..835.45 rows=1 width=26)
DN (actual startup time=0.004..0.004 total time=0.004..0.004 rows=1..1 loops=231..231)
Recheck Cond: ((sourceid = final_dr3_run90002_flat_environment_10.sourceid) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (catalogid = '2'::smallint))
-> Bitmap Index Scan on source_1_pkey (cost=0.00..1.12 rows=1 width=0)
DN (actual startup time=0.002..0.002 total time=0.002..0.002 rows=1..1 loops=231..231)
Index Cond: ((sourceid = final_dr3_run90002_flat_environment_10.sourceid) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint))
-> Append (cost=1.12..835.67 rows=1 width=979)
DN (actual startup time=0.006..0.006 total time=0.006..0.006 rows=1..1 loops=231..231)
-> Bitmap Heap Scan on ts_1 ts_bp (cost=1.12..835.67 rows=1 width=979)
DN (actual startup time=0.005..0.005 total time=0.005..0.005 rows=1..1 loops=231..231)
Recheck Cond: ((sourceid = final_dr3_run90002_flat_environment_10.sourceid) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = '2'::smallint) AND (catalogid = '2'::smallint))
-> Bitmap Index Scan on ts_fix_1_pkey (cost=0.00..1.12 rows=1 width=0)
DN (actual startup time=0.003..0.003 total time=0.003..0.003 rows=1..1 loops=231..231)
Index Cond: ((sourceid = final_dr3_run90002_flat_environment_10.sourceid) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = '2'::smallint))
-> Append (cost=1.12..835.67 rows=1 width=1238)
DN (actual startup time=0.004..0.004 total time=0.004..0.004 rows=1..1 loops=231..231)
-> Bitmap Heap Scan on ts_1 ts_g (cost=1.12..835.67 rows=1 width=1238)
DN (actual startup time=0.003..0.003 total time=0.003..0.003 rows=1..1 loops=231..231)
Recheck Cond: ((sourceid = final_dr3_run90002_flat_environment_10.sourceid) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = '1'::smallint) AND (catalogid = '2'::smallint))
-> Bitmap Index Scan on ts_fix_1_pkey (cost=0.00..1.12 rows=1 width=0)
DN (actual startup time=0.001..0.001 total time=0.001..0.001 rows=1..1 loops=231..231)
Index Cond: ((sourceid = final_dr3_run90002_flat_environment_10.sourceid) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = '1'::smallint))
-> Materialize (cost=101.75..80938.33 rows=1 width=74)
DN (actual startup time=6.495..6.495 total time=6.588..6.588 rows=1..1 loops=2582..2582)
-> Nested Loop (cost=101.75..80938.33 rows=1 width=74)
DN (actual startup time=6.495..6.495 total time=6.587..6.587 rows=1..1 loops=2582..2582)
-> Result (cost=0.00..10000.53 rows=1 width=32)
DN (actual startup time=0.396..0.396 total time=0.397..0.397 rows=1..1 loops=2582..2582)
-> Materialize (cost=101.75..50937.81 rows=1 width=74)
DN (actual startup time=6.097..6.097 total time=6.189..6.189 rows=1..1 loops=2582..2582)
-> Nested Loop (cost=101.75..50937.80 rows=1 width=74)
DN (actual startup time=6.097..6.097 total time=6.189..6.189 rows=1..1 loops=2582..2582)
-> Remote Subquery Scan on all (datanode1,datanode10,datanode11,datanode12,datanode2,datanode3,datanode4,datanode5,datanode6,datanode7,datanode8,datanode9) (cost=101.12..936.65 rows=1 width=979)
DN (actual startup time=0.957..0.957 total time=1.048..1.048 rows=1..1 loops=2582..2582)
-> Append (cost=1.12..835.67 rows=1 width=979)
DN (actual startup time=33.615..33.615 total time=33.615..33.615 rows=1..1 loops=1..1)
-> Bitmap Heap Scan on ts_1 ts_rp (cost=1.12..835.67 rows=1 width=979)
DN (actual startup time=33.615..33.615 total time=33.615..33.615 rows=1..1 loops=1..1)
Recheck Cond: ((sourceid = sourceid) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = '3'::smallint) AND (catalogid = '2'::smallint))
-> Bitmap Index Scan on ts_fix_1_pkey (cost=0.00..1.12 rows=1 width=0)
DN (actual startup time=2.111..2.111 total time=2.111..2.111 rows=1..1 loops=1..1)
Index Cond: ((sourceid = sourceid) AND (sourceid >= '13677065656154881'::bigint) AND (sourceid <= '20800599629677184'::bigint) AND (ftimeseriestype = '3'::smallint))
-> Nested Loop (cost=0.63..40001.15 rows=1 width=64)
DN (actual startup time=5.132..5.132 total time=5.133..5.133 rows=1..1 loops=2582..2582)
-> Result (cost=0.00..10000.53 rows=1 width=32)
DN (actual startup time=0.358..0.358 total time=0.358..0.358 rows=1..1 loops=2582..2582)
-> Function Scan on remove_filtered bp_rp (cost=0.63..10000.63 rows=1 width=64)
DN (actual startup time=4.772..4.772 total time=4.773..4.773 rows=1..1 loops=2582..2582)
Planning time: 475.338 ms
Execution time: 19104.729 ms
(112 rows)
Can you provide a complete SQL? I'll try to reproduce this error,thanks
@LinChuangwei sql is quite convoluted as uses few functions and unfortunately I cannot make the dump with the data, but here goes the query:
set prefer_olap=true;
set work_mem='8GB';
set cpu_tuple_cost=10000;
set pgxl_remote_fetch_size=25000;
set max_parallel_workers_per_gather to 20;
explain analyze
insert into dr3_ops_cs36_mv.final_dr3_animation_variables_exported_1
select sourceid,
sostypes,classification_types,
(g).obstimes g_obstime,(g).val g_vals, cardinality((g).val) glength,
(bp_rp).obstimes bprp_obstime,(bp_rp).val bprp_vals, cardinality((bp_rp).val) bprp_length,
alpha,delta,
varpierror_mas,
fmedian_g,fmedian_bp,fmedian_rp,fmean_g,fmean_bp,fmean_rp
from (
select
sourceid,
s.alpha,s.delta,
varpierror_mas,
fmedian_g,fmedian_bp,fmedian_rp,fmean_g,fmean_bp,fmean_rp,
sostypes,classification_types,
bp_rp
,
(
cleanTimeseriesFluxToMag(
ts_g.obstimes
,ts_g.transitids
,ts_g.vals
,ts_g.valserr
,ts_g.flags
,(tsr_g.derivedtschanges->(gettstypeid('Gaia', 'GAIA_PHOT_G')::text))::int[]
,getzeropoint('Gaia', 'GAIA_PHOT_G')
)
) g
from
dr3_ops_cs36_mv.v_final_dr3_export_helper h
join dr3_ops_cs36_mv.final_dr3_run90002_flat_environment v using(Sourceid)
join timeseriesresult tsr_g using (sourceid)
join timeseriesresult tsr_bp using(runid,catalogid,sourceid)
join timeseriesresult tsr_rp using(runid,catalogid,sourceid)
join source s using (catalogid,sourceid)
join ts as ts_bp using (catalogid,sourceid)
join ts as ts_g using (catalogid,sourceid)
join ts as ts_rp using (catalogid,sourceid)
join lateral (
select gaiafluxtomagnitudecomposite ('Gaia', gettstypename(ts_bp.ftimeseriestype),ts_bp.obstimes::float8[],ts_bp.vals::float8[],ts_bp.valserr::float8[],false ) V
) bp on true
join lateral (
select gaiafluxtomagnitudecomposite ('Gaia', gettstypename(ts_rp.ftimeseriestype),ts_rp.obstimes::float8[],ts_rp.vals::float8[],ts_rp.valserr::float8[],false ) v
) rp on true
join lateral (select
array_cat(array_cat(
(tsr_g.derivedtschanges->((getInputTsType(tsr_g.ftimeseriestype)::text)))::int[],
(tsr_bp.derivedtschanges->((getInputTsType(tsr_bp.ftimeseriestype)::text)))::int[]) ,
(tsr_rp.derivedtschanges->((getInputTsType(tsr_rp.ftimeseriestype)::text)))::int[]) changes) changes on true
join lateral
remove_filtered(
((bp.v).obstimes::float8[],(bp.v).vals::float8[],(bp.v).valserr::float8[],ts_bp.flags::int[])::ts_bucket_p,
((rp.v).obstimes::float8[],(rp.v).vals::float8[],(rp.v).valserr::float8[],ts_rp.flags::int[])::ts_bucket_p,
changes
) bp_rp on true
where
catalogid = getmaincatalog()
and runid = 3852
and tsr_g.ftimeseriestype = gettsresulttypeid(3852,'ExtremeErrorCleaningMagnitudeDependent_FOV_G')
and tsr_bp.ftimeseriestype = gettsresulttypeid(3852,'ExtremeErrorCleaningMagnitudeDependent_FOV_BP')
and tsr_rp.ftimeseriestype = gettsresulttypeid(3852,'ExtremeErrorCleaningMagnitudeDependent_FOV_RP')
and ts_g.ftimeseriestype = gettstypeid('Gaia','GAIA_PHOT_G')
and ts_bp.ftimeseriestype = gettstypeid('Gaia','GAIA_PHOT_BP')
and ts_rp.ftimeseriestype = gettstypeid('Gaia','GAIA_PHOT_RP')
and h.sourceid >= 13677065656154881 AND h.sourceid <= 20800599629677184
and v.sourceid >= 13677065656154881 AND tsr_g.sourceid <= 20800599629677184
and tsr_g.sourceid >= 13677065656154881 AND tsr_g.sourceid <= 20800599629677184
and tsr_bp.sourceid >= 13677065656154881 AND tsr_bp.sourceid <= 20800599629677184
and tsr_rp.sourceid >= 13677065656154881 AND tsr_rp.sourceid <= 20800599629677184
and s.sourceid >= 13677065656154881 AND s.sourceid <= 20800599629677184
and ts_g.sourceid >= 13677065656154881 AND ts_g.sourceid <= 20800599629677184
and ts_bp.sourceid >= 13677065656154881 AND ts_bp.sourceid <= 20800599629677184
and ts_rp.sourceid >= 13677065656154881 AND ts_rp.sourceid <= 20800599629677184
) fuller
gives
ERROR: node:datanode12, backend_pid:11466, nodename:datanode12,backend_pid:11466,message:node:datanode7, backend_pid:1468, nodename:datanode7,backend_pid:1468,message:Failed to read from SQueue, CONSUMER_ERROR set, err_msg "unrecognized node type: 113",,
source, ts, timeseriesresult tables are partitioned and sharded by sourceid.
The last range part of the query is just to make sure we deal with single partitions all along as this query is executed in sourceid range batches on a relatively large dataset.
All functions are marked parallel safe and immutable, all are sql ones except cleanTimeseriesFluxToMag
which is pljava.
can you help set log_error_verbosity to verbose and see the specific print location from the pg_log when an error occurs?
xz-log-insert_113_failure.csv Kibana log from around the event attached. The failing sql is from one of the pgsql functions.
2022-01-12 13:32:38 CET [1427,coord(0.0)]:xid[0-532/575] [3-1] user=,db=,client=,query=,FATAL: 57P01: terminating connection due to administrator command#0122022-01-12 13:32:38 CET [1427,coord(0.0)]:xid[0-532/575] [4-1] user=,db=,client=,query=,LOCATION: ProcessInterrupts, postgres.c:3956#0122022-01-12 13:32:38 CET [1427,coord(0.0)]:xid[0-532/575] [5-1] user=,db=,client=,query=,STATEMENT: #012#011with recursive parents(tstype_parent) as#012#011(#012#011 select tstype_parent from tstyperesult_tstyperesult where tstype_me = intstype#012#011 union all#012#011 select ti.tstype_parent from tstyperesult_tstyperesult ti,parents p where ti.tstype_me = p.tstype_parent#012#011#012#011 )#012#011select tstype_input from parents join tstype_tstyperesult on (tstype_parent = tstype_output)#012#011 join timeseriestype tst on (tstype_input = tstype)#012#011 join timeseriesresulttype tsrt on (tsrt.bandpass = tst.fbandpass and tsrt.tstype = intstype)#012#011;
So maybe the failure is related to CTE as there have been changes recently:
REATE OR REPLACE FUNCTION dr3_ops_cs36.getinputtstype(intstype integer)
RETURNS integer
LANGUAGE sql
IMMUTABLE
AS $function$
with recursive parents(tstype_parent) as
(
select tstype_parent from tstyperesult_tstyperesult where tstype_me = intstype
union all
select ti.tstype_parent from tstyperesult_tstyperesult ti,parents p where ti.tstype_me = p.tstype_parent
)
select tstype_input from parents join tstype_tstyperesult on (tstype_parent = tstype_output)
join timeseriestype tst on (tstype_input = tstype)
join timeseriesresulttype tsrt on (tsrt.bandpass = tst.fbandpass and tsrt.tstype = intstype)
;
$function$
(plus I am noticing now that it is not parallel safe strangely)
but making it parallel safe yields the same error.
Thank you for your reply, we will continue to analyze the problem
@runewrz Was the error identified? Could you point to the commit fixing it?
@runewrz Was the error identified? Could you point to the commit fixing it?
Thanks for your reply, we will continue to follow up on the issue.
Looks like a new error we get (on V2.2) quite often, but without a clear repeatability pattern but happening very often:
enum NodeType 113 is mapped to T_ProjectSetPath it seems:
Example plan that fails:
When sometimes the query works (which is
insert into .. select ..
) we getting analyze plan like this:Normally such query could be pushed down,but as you see it does not happen and the error is a quasi-blocker for us. Let us know please if you could think of a workaround - otherwise we probably go back to pre v2.2 code.