elastacloud / spark-excel

A Spark data source for reading Microsoft Excel files
https://www.elastacloud.com
Apache License 2.0
13 stars 5 forks source link

Integer values are converting into float while reading with inferschema #26

Closed Manasa81 closed 1 year ago

Manasa81 commented 1 year ago

Original Data.Check the contractID in original data. image Check after reading with inferschema the .0 has been added to id's image

And also unlike csv badrecordspath while reading with schema is not working in this.Did someone tried schema withbadrecordspath?

dazfuller commented 1 year ago

Hi

The addition of the ".0" is how the value is being displayed. It does this because the value is a double and this, in turn, is because Excel stores all worksheet numbers as doubles (despite all the options in Excel they end up as doubles). They can be easily cast to integers after extraction though.

The badrecordpath is an interesting one. For things like csv and json the string of the record read so far is written out, but Excel doesn't have a string representation of a record, and so it would need to be converted to a different format before being written out. But could be something that we look at as an option.

Manasa81 commented 1 year ago

Hi

Thank you for the response

I am reading the dataframe with .schema option expecting all bad records will go into badrecordsPath given in the options but it keeping null values in place of badrecords(just like cast option).So, I thought of doing subtract operation between dataframe(reading without schema option) and dataframe(with schema option or casted dataframe) to get badrecords. This logic is working for me for delta sources as delta also won't support badrecordsPath option. Since the dataframe(reading without schema) has all the double values in it the subtract/left anti join wouldn't work. So, Is there is any option to get bad records seperately while reading a folder of excels with schema?

On Sat, 20 May, 2023, 11:50 am Darren Fuller, @.***> wrote:

Hi

The addition of the ".0" is how the value is being displayed. It does this because the value is a double and this, in turn, is because Excel stores all worksheet numbers as doubles (despite all the options in Excel they end up as doubles). They can be easily cast to integers after extraction though.

The badrecordpath is an interesting one. For things like csv and json the string of the record read so far is written out, but Excel doesn't have a string representation of a record, and so it would need to be converted to a different format before being written out. But could be something that we look at as an option.

— Reply to this email directly, view it on GitHub https://github.com/elastacloud/spark-excel/issues/26#issuecomment-1555682402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6TTQMEL75CLI776LJUP7DXHBPDZANCNFSM6AAAAAAYH5BLR4 . You are receiving this because you authored the thread.Message ID: @.***>

dazfuller commented 1 year ago

Bad records tend to be those which do not comply with the document format, so invalid JSON in a multi-line JSON file, or CSV files where the line doesn't have enough records.

So reading this, are you looking for records which do not conform to the supplied schema?

Manasa81 commented 1 year ago

Yes In our case bad records are the records which do not obey schema

On Sat, 20 May, 2023, 3:49 pm Darren Fuller, @.***> wrote:

Bad records tend to be those which do not comply with the document format, so invalid JSON in a multi-line JSON file, or CSV files where the line doesn't have enough records.

So reading this, are you looking for records which do not conform to the supplied schema?

— Reply to this email directly, view it on GitHub https://github.com/elastacloud/spark-excel/issues/26#issuecomment-1555880855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6TTQMYDAMKPEC54MHDY2LXHCLCHANCNFSM6AAAAAAYH5BLR4 . You are receiving this because you authored the thread.Message ID: @.***>

dazfuller commented 1 year ago

So one of the things that we could do is implement a failure safe parser which would write the data to a "corrupt" column, which you would need to specify in the schema you're passing in. The issue would be working out how to represent the Excel data this way, and there are some who advise against using this, but it would mean that in PERMISSIVE mode you would have a way of getting to that data.

Another option would be to simply check for null values and filter out those rows, especially in instances where you know the column should not contain nulls.

The problem with Excel is you could end up with a lot of these as it's pretty (very) rubbish as a data interchange format, so blank rows, partial rows, etc... would very likely fail a schema check.

dazfuller commented 1 year ago

Have a look at the PR #27 and let me know if it's the kind of thing you're looking for

Manasa81 commented 1 year ago

Hi Darren,

That was very quick Will check and get back to you

Thank you!

On Sun, 21 May, 2023, 3:03 pm Darren Fuller, @.***> wrote:

Have a look at the PR #27 https://github.com/elastacloud/spark-excel/pull/27 and let me know if it's the kind of thing you're looking for

— Reply to this email directly, view it on GitHub https://github.com/elastacloud/spark-excel/issues/26#issuecomment-1556130838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6TTQL5VSYRRG7T7SRDVWDXHHOOFANCNFSM6AAAAAAYH5BLR4 . You are receiving this because you authored the thread.Message ID: @.***>

dazfuller commented 1 year ago

I'm trying to avoid the list of house chores I need to do 😆

Manasa81 commented 1 year ago

Lol!

On Sun, 21 May, 2023, 3:14 pm Darren Fuller, @.***> wrote:

I'm trying to avoid the list of house chores I need to do 😆

— Reply to this email directly, view it on GitHub https://github.com/elastacloud/spark-excel/issues/26#issuecomment-1556133039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6TTQI74WS3S2CVS3GI4XLXHHPW5ANCNFSM6AAAAAAYH5BLR4 . You are receiving this because you authored the thread.Message ID: @.***>

Manasa81 commented 1 year ago

Hi Darren

Yes, PR#27 is what I am looking for, but now I'm getting another issue!

image

dazfuller commented 1 year ago

Weird, that looks like a problem reading the Excel source file. Is there a chance you're able to share it?

Manasa81 commented 1 year ago

yes, this is the file I was trying to read. I placed this file inside a folder, and I have given folder path in spark read command. SampleData.xlsx

Manasa81 commented 1 year ago

Hi Darren, Any update on this issue?

dazfuller commented 1 year ago

I've loaded up the sample data and can't re-create the error that you're seeing. I'm using a version of the library compiled for Spark 3.3.2 against Databricks Runtime 12.2 LTS

image

Manasa81 commented 1 year ago

I am using DBR of 11.3LTS and downloaded spark-excel-3.3.0_0.1.10.jar file. Can you downgrade you DBR and try again if possible?which JAR File you are using for 12.2LTS?

EDIT Now I redownloaded the jar file and ran again, it is working fine. Small question from my end, if two files has two different schemas in a folder how to merge schemas(for csv's, we can use mergeSchema option) of two files?Right now it is only taking first file schema!

dazfuller commented 1 year ago

So that is by design, as Excel files tend to get put together by type and so have the same schema. If you know the schema in advance then providing it is always the better option, even when using CSV files as it will be more performant.