[Feature][Connectors] LocalFile Support reading gz

zhdech commented 1 week ago

Purpose of this pull request

solve https://github.com/apache/seatunnel/issues/8019

Does this PR introduce any user-facing change?

no

How was this patch tested?

Check list

[ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
[ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
[ ] If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config
[ ] Update the release-note.

zhdech commented 1 week ago

Thanks @zhdech ! Could you add a test case for this feature?

OK。May I ask how to resolve the following construction errors? What do you need me to do? 好的。请问，针对下面的构建错误，如何解决？需要我怎么做？

Hisoka-X commented 1 week ago

May I ask how to resolve the following construction errors? What do you need me to do?

Try to retrigger failed ci. It is unstable. cc @zhangshenghang

zhdech commented 2 days ago

@Hisoka-X Sir, please help me check it.

corgy-w commented 1 day ago

Forgot to add, although .xlsx files do not support reading after being compressed by gz, .xls does. Can be added later for testing. cc @Hisoka-X @zhdech

zhdech commented 1 day ago

Forgot to add, although .xlsx files do not support reading after being compressed by gz, .xls does. Can be added later for testing. cc @Hisoka-X @zhdech When testing. xls locally, it prompts that it is not supported

The configuration is as follows： `env { parallelism = 1 job.mode = "BATCH" spark.app.name = "SeaTunnel" spark.executor.instances = 2 spark.executor.cores = 1 spark.executor.memory = "1g" spark.master = local job.mode = "BATCH" }

source { LocalFile { path = "/seatunnel/read/gz/excel/single/e2e-xls-gz.xls.gz" result_table_name = "fake" file_format_type = excel archive_compress_codec = "gz" field_delimiter = ; skip_header_row_number = 1 schema = { fields { c_map = "map<string, string>" c_array = "array" c_string = string c_boolean = boolean c_tinyint = tinyint c_smallint = smallint c_int = int c_bigint = bigint c_float = float c_double = double c_bytes = bytes c_date = date c_decimal = "decimal(38, 18)" c_timestamp = timestamp c_row = { c_map = "map<string, string>" c_array = "array" c_string = string c_boolean = boolean c_tinyint = tinyint c_smallint = smallint c_int = int c_bigint = bigint c_float = float c_double = double c_bytes = bytes c_date = date c_decimal = "decimal(38, 18)" c_timestamp = timestamp } } } } }

sink { Assert { rules { row_rules = [ { rule_type = MAX_ROW rule_value = 5 }, { rule_type = MIN_ROW rule_value = 5 } ], field_rules = [ { field_name = c_string field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = c_boolean field_type = boolean field_value = [ { rule_type = NOT_NULL } ] }, { field_name = c_double field_type = double field_value = [ { rule_type = NOT_NULL } ] } ] } } } `

corgy-w commented 22 hours ago

When testing. xls locally, it prompts that it is not supported

Got it. I will check it out when I have time. tks @zhdech

apache / seatunnel