apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.73k stars 4.58k forks source link

[Bug] [dataquality] Data quality - null value detection - execution error #16435

Open wuchunfu opened 1 month ago

wuchunfu commented 1 month ago

Search before asking

What happened

When I use PostgreSQL as the initialization database for the dolphin scheduler and run the data quality control detection task, the task reports an error indicating that the dolphin scheduler schema does not exist

What you expected to happen

[INFO] 2024-08-09 14:54:29.629 +0800 -  -> 
    24/08/09 14:54:29 INFO Client: Application report for application_1722308400032_0053 (state: RUNNING)
[INFO] 2024-08-09 14:54:30.630 +0800 -  -> 
    24/08/09 14:54:30 INFO Client: Application report for application_1722308400032_0053 (state: FINISHED)
    24/08/09 14:54:30 INFO Client: 
         client token: N/A
         diagnostics: User class threw exception: org.postgresql.util.PSQLException: ERROR: schema "dolphinschedulers" does not exist
      Position: 14
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:356)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408)
        at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:329)
        at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:315)
        at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:291)
        at org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:265)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
        at org.apache.dolphinscheduler.data.quality.flow.batch.writer.JdbcWriter.write(JdbcWriter.java:87)
        at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.executeWriter(SparkBatchExecution.java:132)
        at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.execute(SparkBatchExecution.java:58)
        at org.apache.dolphinscheduler.data.quality.context.DataQualityContext.execute(DataQualityContext.java:62)
        at org.apache.dolphinscheduler.data.quality.DataQualityApplication.main(DataQualityApplication.java:78)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)

         ApplicationMaster host: 10.10.4.230
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1723186518789
         final status: FAILED
         tracking URL: http://node02:8088/proxy/application_1722308400032_0053/
         user: default
    Exception in thread "main" org.apache.spark.SparkException: Application application_1722308400032_0053 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1269)
        at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1627)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    24/08/09 14:54:30 INFO ShutdownHookManager: Shutdown hook called
    24/08/09 14:54:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-71c505a9-2358-4455-b8dd-3838611055c9
    24/08/09 14:54:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-33420b51-2b30-4a48-9b4f-502a4b7976a0
[INFO] 2024-08-09 14:54:30.632 +0800 - process has exited. execute path:/tmp/dolphinscheduler/exec/process/default/14382384652384/14559278303968_3/39/46, processId:1301503 ,exitStatusCode:1 ,processWaitForStatus:true ,processExitValue:1
[INFO] 2024-08-09 14:54:30.633 +0800 - Start finding appId in /opt/dolphinscheduler/worker-server/logs/20240809/14559278303968/3/39/46.log, fetch way: log 
[INFO] 2024-08-09 14:54:30.639 +0800 - Find appId: application_1722308400032_0053 from /opt/dolphinscheduler/worker-server/logs/20240809/14559278303968/3/39/46.log
[INFO] 2024-08-09 14:54:30.640 +0800 - ***********************************************************************************************
[INFO] 2024-08-09 14:54:30.640 +0800 - *********************************  Finalize task instance  ************************************
[INFO] 2024-08-09 14:54:30.640 +0800 - ***********************************************************************************************
[INFO] 2024-08-09 14:54:30.641 +0800 - Upload output files: [] successfully
[INFO] 2024-08-09 14:54:30.657 +0800 - Send task execute status: FAILURE to master : 10.10.4.251:1234
[INFO] 2024-08-09 14:54:30.658 +0800 - Remove the current task execute context from worker cache
[INFO] 2024-08-09 14:54:30.658 +0800 - The current execute mode isn't develop mode, will clear the task execute file: /tmp/dolphinscheduler/exec/process/default/14382384652384/14559278303968_3/39/46
[INFO] 2024-08-09 14:54:30.697 +0800 - Success clear the task execute file: /tmp/dolphinscheduler/exec/process/default/14382384652384/14559278303968_3/39/46
[INFO] 2024-08-09 14:54:30.699 +0800 - FINALIZE_SESSION

How to reproduce

Use PostgreSQL as the initialization database for the Dolphin scheduler, and run the data quality control detection task to reproduce it

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

Code of Conduct