apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
8.06k stars 1.83k forks source link

[Feature][Connectors] LocalFile Support reading gz #8025

Open zhdech opened 1 week ago

zhdech commented 1 week ago

Purpose of this pull request

solve https://github.com/apache/seatunnel/issues/8019

Does this PR introduce any user-facing change?

no

How was this patch tested?

Check list

zhdech commented 1 week ago

Thanks @zhdech ! Could you add a test case for this feature?

OK。May I ask how to resolve the following construction errors? What do you need me to do? 好的。请问,针对下面的构建错误,如何解决?需要我怎么做?

Hisoka-X commented 1 week ago

May I ask how to resolve the following construction errors? What do you need me to do?

Try to retrigger failed ci. It is unstable. cc @zhangshenghang

zhdech commented 2 days ago

@Hisoka-X Sir, please help me check it.

corgy-w commented 1 day ago

Forgot to add, although .xlsx files do not support reading after being compressed by gz, .xls does. Can be added later for testing. cc @Hisoka-X @zhdech

zhdech commented 1 day ago

Forgot to add, although .xlsx files do not support reading after being compressed by gz, .xls does. Can be added later for testing. cc @Hisoka-X @zhdech When testing. xls locally, it prompts that it is not supported image

The configuration is as follows: `env { parallelism = 1 job.mode = "BATCH" spark.app.name = "SeaTunnel" spark.executor.instances = 2 spark.executor.cores = 1 spark.executor.memory = "1g" spark.master = local job.mode = "BATCH" }

source { LocalFile { path = "/seatunnel/read/gz/excel/single/e2e-xls-gz.xls.gz" result_table_name = "fake" file_format_type = excel archive_compress_codec = "gz" field_delimiter = ; skip_header_row_number = 1 schema = { fields { c_map = "map<string, string>" c_array = "array" c_string = string c_boolean = boolean c_tinyint = tinyint c_smallint = smallint c_int = int c_bigint = bigint c_float = float c_double = double c_bytes = bytes c_date = date c_decimal = "decimal(38, 18)" c_timestamp = timestamp c_row = { c_map = "map<string, string>" c_array = "array" c_string = string c_boolean = boolean c_tinyint = tinyint c_smallint = smallint c_int = int c_bigint = bigint c_float = float c_double = double c_bytes = bytes c_date = date c_decimal = "decimal(38, 18)" c_timestamp = timestamp } } } } }

sink { Assert { rules { row_rules = [ { rule_type = MAX_ROW rule_value = 5 }, { rule_type = MIN_ROW rule_value = 5 } ], field_rules = [ { field_name = c_string field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = c_boolean field_type = boolean field_value = [ { rule_type = NOT_NULL } ] }, { field_name = c_double field_type = double field_value = [ { rule_type = NOT_NULL } ] } ] } } } `

corgy-w commented 22 hours ago

When testing. xls locally, it prompts that it is not supported

Got it. I will check it out when I have time. tks @zhdech