apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
8.06k stars 1.83k forks source link

[Bug] [Seatunnel] Error when conditional column is not used in SELECT clause #7974

Closed bilst-visa closed 2 weeks ago

bilst-visa commented 2 weeks ago

Search before asking

What happened

An error occurs when a conditional column is not included in the SELECT clause while using JDBC-Hive. For example select name, age from xyz where emp_id=100 this fails with error Invalid table alias or column reference 'emp_id': (possible column names are: name,age)

SeaTunnel Version

2.3.8

SeaTunnel Config

Default conf

Running Command

query: "select name, age from xyz"
"where_condition"="where emp_id=100"

Error Exception

org.apache.seatunnel.connectors.seatunnel.jdbc.exception.JdbcConnectorException: ErrorCode:[JDBC-04], ErrorDescription:[Connector database failed] - open() failed.Unknown column 'emp_id' in 'where clause'
    at org.apache.seatunnel.connectors.seatunnel.jdbc.internal.JdbcInputFormat.open(JdbcInputFormat.java:104)
    at org.apache.seatunnel.connectors.seatunnel.jdbc.source.JdbcSourceReader.pollNext(JdbcSourceReader.java:67)
    at org.apache.seatunnel.engine.server.task.flow.SourceFlowLifeCycle.collect(SourceFlowLifeCycle.java:159)
    at org.apache.seatunnel.engine.server.task.SourceSeaTunnelTask.collect(SourceSeaTunnelTask.java:127)
    at org.apache.seatunnel.engine.server.task.SeaTunnelTask.stateProcess(SeaTunnelTask.java:168)
    at org.apache.seatunnel.engine.server.task.SourceSeaTunnelTask.call(SourceSeaTunnelTask.java:132)
    at org.apache.seatunnel.engine.server.TaskExecutionService$BlockingWorker.run(TaskExecutionService.java:693)
    at org.apache.seatunnel.engine.server.TaskExecutionService$NamedTaskWrapper.run(TaskExecutionService.java:1018)
    at org.apache.seatunnel.api.tracing.MDCRunnable.run(MDCRunnable.java:39)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.sql.SQLSyntaxErrorException: Unknown column 'emp_id' in 'where clause'
    at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
    at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
    at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953)
    at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery(ClientPreparedStatement.java:1009)
    at org.apache.seatunnel.connectors.seatunnel.jdbc.internal.JdbcInputFormat.open(JdbcInputFormat.java:99)
    ... 13 more

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

bilst-visa commented 2 weeks ago

Here's my analysis:

I debugged the code and found that currently when we give where clause separately, the statement which gets executed is like below "SELECT * FROM (SELECT 'name', 'age' from 'xyz') tmp where emp_id=100" which is not the correct behavior. I think it should be "SELECT 'name', 'age' from 'xyz' where emp_id=100"

bilst-visa commented 2 weeks ago

cc @Hisoka-X @arshadmohammad

arshadmohammad commented 2 weeks ago

it makes sense to query data with a filter. Fetching the records first and then filtering is inefficient and costly. I'm not sure why current behaviour is this way, but it doesn't seem intentional. It appears to be a bug

Hisoka-X commented 2 weeks ago

cc @hailin0

Hisoka-X commented 2 weeks ago

The core problem is that the filter field in the where_condition must appear in the result of the subquery. If the where condition is directly concatenated to the subquery, incompatibility will occur. For example:

query: """ SELECT product, SUM(quantity) AS total_quantity
FROM sales
WHERE quantity > 10
GROUP BY product"""
"where_condition"="where total_quantity = 100"

It will become

SELECT product, SUM(quantity) AS total_quantity
FROM sales
WHERE quantity > 10
GROUP BY product where total_quantity = 100

This is not a right SQL. But it works when use temp table

SELECT * 
FROM (
    SELECT product, SUM(quantity) AS total_quantity
    FROM sales
    WHERE quantity > 10
    GROUP BY product
) tmp WHERE total_quantity = 100;
bilst-visa commented 2 weeks ago

@Hisoka-X I got what you are saying but I find below two cases are a problem with the current behavior.

  1. If you want to use a where clause with a main query, not the subquery
  2. How do we achieve this using SeaTunnelTask-web? As it always appends it to a subquery
Hisoka-X commented 2 weeks ago

If you want to use a where clause with a main query, not the subquery

With engine only, we can directly use

query: "select name, age from xyz where emp_id=100"

How do we achieve this using SeaTunnelTask-web? As it always appends it to a subquery

In seatunnel web, we can not use this for now. But I believe most database will do same optimization, so subquery not a big problem with function.

Case by case, we can change query like this:

query: "select name, age, emp_id from xyz"
"where_condition"="where emp_id=100"

It should be worked, then we can use transform to drop emp_id field. WDYT about this way? @arshadmohammad @bilst-visa

arshadmohammad commented 2 weeks ago

It should be worked, then we can use transform to drop emp_id field. WDYT about this way? @arshadmohammad @bilst-visa

yes, it will work but i think it will be inefficient and costly.

arshadmohammad commented 2 weeks ago

My understanding is that in SeaTunnel queries, we have two WHERE clauses: an inner WHERE clause and an outer WHERE clause. When the job configuration file is created manually, users can decide what to include in the inner and outer WHERE clauses without any issues.

However, in SeaTunnel-Web, there is currently no option to generate an inner WHERE clause; only the outer WHERE clause is supported. I suggest that SeaTunnel-Web internally appends the WHERE clause to the main query, ensuring that only the inner WHERE clause is used. Any thoughts on this?

Hisoka-X commented 2 weeks ago

I suggest that SeaTunnel-Web internally appends the WHERE clause to the main query, ensuring that only the inner WHERE clause is used. Any thoughts on this?

+111. This is a pretty good idea.

arshadmohammad commented 2 weeks ago

Thanks @bilst-visa for raising the PR https://github.com/apache/seatunnel-web/pull/238 PR is merged.