crealytics / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
449 stars 145 forks source link

[BUG] When Read Excel Files, Several Errors Using Java #837

Open yumble opened 4 months ago

yumble commented 4 months ago

Is there an existing issue for this?

Current Behavior

Some problems are occurring when reading Excel files with Spark, Java.

I'm currently making a service,

This service cannot specify the schema of the file. Because users upload their files, the file format is random.

Excel file capture photos

image

photos of formats that should be displayed

image

photos that are currently experiencing the problem.

image

The problems of the current Excel file are as follows.

  1. [] The first column does not apply the date format (it should be displayed like the second picture, but it is not recognized) -> i want to display "yyyy-MM-dd'T'HH:mm:ss.SSSSZ" format.

  2. [] Despite the same cell format for the second and third columns, the second column appears as a string with "₩ ", and the third column has a Scientific notation format

But I want both to be expressed in numbers. ₩ 100,000 -> 100000 ( i want to display this format)

=> In the second column, the numbers are values, and the form adds monetary units and spaces Like 3000000 -> ₩ 3,000,000 In the third column, the plus value of Excel cells is the default value. Like =C3+(C3*0.35)

  1. [] Columns without headers should also be displayed if there is data, but columns without headers are currently ignored.
String workSheet = String.format("'%s'!A1", excel.getWorkSheet());

        Dataset<Row> df = sparkSession.read()
                .format("com.crealytics.spark.excel")
                .option("dataAddress", workSheet)
                .option("header", excel.isDefaultHeader())
                .option("maxColumns", 1000) //todo GUARDRAILS
                .option("columnNameOfCorruptRecord", "true")
                .option("columnNameOfRowNumber", "true")
                .option("inferSchema", "false")
                .option("enforceSchema", "false")
                .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSSZ")
                .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSSZ")
                .load(paths.left);
        df.show();
        df.schema();

I've tried changing the values of the options written here, but it's still the same situation.

Please let me know if you know one problem..

Expected Behavior

  1. [] i want to display "yyyy-MM-dd'T'HH:mm:ss.SSSSZ" format.

  2. [] I want to display plain number without cell formats(styles), without Scientific notation.

  3. [] Columns without headers should also be displayed if there is data, but columns without headers are currently ignored.

Steps To Reproduce

error.xlsx

Environment

- Spark version:

implementation group: 'org.apache.spark', name: 'spark-yarn_2.12', version: '3.5.0'
implementation group: 'org.apache.spark', name: 'spark-core_2.12', version: '3.5.0'
    implementation group: 'org.apache.spark', name: 'spark-sql_2.12', version: '3.5.0'

- Spark-Excel version:

implementation group: 'com.crealytics', name: 'spark-excel_2.12', version: '0.14.0'

- OS: Spring boot 2.7.6

- Cluster environment

Anything else?

No response

github-actions[bot] commented 4 months ago

Please check these potential duplicates:

nightscape commented 4 months ago

Please always use the newest version when reporting bugs. Some things might already have been fixed in the mean time.