apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.98k stars 1.8k forks source link

[Bug] [seatunnel-connectors-v2] [connector-elasticsearch] Incorrect Encoding When Writing to StarRocks Resulting in Garbled Text #7545

Open Hachiman566 opened 2 months ago

Hachiman566 commented 2 months ago

Search before asking

What happened

There is an issue with the encoding format when reading data from Elasticsearch . The root cause is that when reading data from Elasticsearch, the response header Content-Type does not include a charset encoding. SeaTunnel defaults to ISO-8859-1 encoding in the absence of a charset specification. However, StarRocks only supports UTF-8 encoding, leading to the observed garbled text. Need to adjust the encoding handling to ensure compatibility and data integrity. Open to discussion on potential solutions and improvements. I am considering submitting a pull request to address this issue.

SeaTunnel Version

2.3.3

SeaTunnel Config

env {
  parallelism = 2
  job.mode = "BATCH"
  checkpoint.interval = 10000
}

source {
    Elasticsearch {
        hosts = ["http://127.0.0.1:10014"]
        index = "sec_evt_info"
        username = "elastic"
        password = ""
        result_table_name = "src_es"
        schema = {
         fields {
            test_data = string
            }
        }
        query = {"range":{"recordTime":{"gte":"2024-07-01 00:00:00"}}}

        tls_verify_certificate = false
    }
}

sink {
  StarRocks {
    source_table_name = "src_es"
    nodeUrls = ["127.0.0.1:9030"]
    base-url = "jdbc:mysql://127.0.0.1:8030/"
    username = root
    password = "password"
    database = "db"
    table = "table_name"
    batch_max_rows = 1000
    starrocks.config = {
      format = "CSV"
    }
  }
}

Running Command

./bin/seatunnel.sh --config job/es2starrocks.config

Error Exception

None

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.