apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.86k stars 1.77k forks source link

[Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data #6868

Closed AdkinsHan closed 2 months ago

AdkinsHan commented 4 months ago

Search before asking

What happened

When I used spark local mode to read the local csv file into the hive table, the data was multiplied by 3N times, but this did not happen when I used spark yarn mode. Because I used seatunnnel 1.5 before, the migration process was local, but when I tested version 2.3.5, the data was doubled. summary :
--master local --deploy-mode client 3 times --master yarn --deploy-mode client 3 times --master yarn --deploy-mode cluster right I have 2076 in my cvs file ,but select count(1) from xx then shows 3*2076

SeaTunnel Version

2.3.5

SeaTunnel Config

env {
  # seatunnel defined streaming batch duration in seconds
  execution.parallelism = 4
  job.mode = "BATCH"
  spark.executor.instances = 4
  spark.executor.cores = 4
  spark.executor.memory = "4g"
  spark.sql.catalogImplementation = "hive"
  spark.hadoop.hive.exec.dynamic.partition = "true"
  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
}

source {
    LocalFile {
    schema {
            fields {
                  sku = string
                  sku_group = string
                  pb = string
                  series = string
                  pn = string
                  mater_n = string
                }
    }
      path = "/data/ghyworkbase/uploadfile/h019-ods_file_pjp_old_new_sku_yy.csv"
      file_format_type = "csv"
      skip_header_row_number=1
      result_table_name="ods_file_pjp_old_new_sku_yy_source"
    }
}

transform {
  Sql {
    source_table_name="ods_file_pjp_old_new_sku_yy_source"
    query = "select sku,sku_group,pb,series,pn,mater_n,TO_CHAR(CURRENT_DATE(),'yyyy') as dt_year from ods_file_pjp_old_new_sku_yy_source "
    result_table_name="ods_file_pjp_old_new_sku_yy"

  }
}

sink {

#   Console {
#      source_table_name = "ods_file_pjp_old_new_sku_yy"
#    }

   Hive {
     source_table_name="ods_file_pjp_old_new_sku_yy"
     table_name = "ghydata.ods_file_pjp_old_new_sku_yy"
     metastore_uri = "thrift://"
   }

}

Running Command

sh /data/seatunnel/seatunnel-2.3.4/bin/start-seatunnel-spark-3-connector-v2.sh \
  --master local \
  --deploy-mode client \
  --queue ghydl \
  --executor-instances 4 \
  --executor-cores 4 \
  --executor-memory 4g \
  --name "h019-ods_file_pjp_old_new_sku_yy" \
  --config /2.3.5/h019-ods_file_pjp_old_new_sku_yy.conf

Error Exception

nothing but data 3*

Zeta or Flink or Spark Version

No response

Java or Scala Version

/usr/local/jdk/jdk1.8.0_341

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 2 months ago

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.