Closed Manasa81 closed 1 year ago
Hi
The addition of the ".0" is how the value is being displayed. It does this because the value is a double and this, in turn, is because Excel stores all worksheet numbers as doubles (despite all the options in Excel they end up as doubles). They can be easily cast to integers after extraction though.
The badrecordpath is an interesting one. For things like csv and json the string of the record read so far is written out, but Excel doesn't have a string representation of a record, and so it would need to be converted to a different format before being written out. But could be something that we look at as an option.
Hi
Thank you for the response
I am reading the dataframe with .schema option expecting all bad records will go into badrecordsPath given in the options but it keeping null values in place of badrecords(just like cast option).So, I thought of doing subtract operation between dataframe(reading without schema option) and dataframe(with schema option or casted dataframe) to get badrecords. This logic is working for me for delta sources as delta also won't support badrecordsPath option. Since the dataframe(reading without schema) has all the double values in it the subtract/left anti join wouldn't work. So, Is there is any option to get bad records seperately while reading a folder of excels with schema?
On Sat, 20 May, 2023, 11:50 am Darren Fuller, @.***> wrote:
Hi
The addition of the ".0" is how the value is being displayed. It does this because the value is a double and this, in turn, is because Excel stores all worksheet numbers as doubles (despite all the options in Excel they end up as doubles). They can be easily cast to integers after extraction though.
The badrecordpath is an interesting one. For things like csv and json the string of the record read so far is written out, but Excel doesn't have a string representation of a record, and so it would need to be converted to a different format before being written out. But could be something that we look at as an option.
— Reply to this email directly, view it on GitHub https://github.com/elastacloud/spark-excel/issues/26#issuecomment-1555682402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6TTQMEL75CLI776LJUP7DXHBPDZANCNFSM6AAAAAAYH5BLR4 . You are receiving this because you authored the thread.Message ID: @.***>
Bad records tend to be those which do not comply with the document format, so invalid JSON in a multi-line JSON file, or CSV files where the line doesn't have enough records.
So reading this, are you looking for records which do not conform to the supplied schema?
Yes In our case bad records are the records which do not obey schema
On Sat, 20 May, 2023, 3:49 pm Darren Fuller, @.***> wrote:
Bad records tend to be those which do not comply with the document format, so invalid JSON in a multi-line JSON file, or CSV files where the line doesn't have enough records.
So reading this, are you looking for records which do not conform to the supplied schema?
— Reply to this email directly, view it on GitHub https://github.com/elastacloud/spark-excel/issues/26#issuecomment-1555880855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6TTQMYDAMKPEC54MHDY2LXHCLCHANCNFSM6AAAAAAYH5BLR4 . You are receiving this because you authored the thread.Message ID: @.***>
So one of the things that we could do is implement a failure safe parser which would write the data to a "corrupt" column, which you would need to specify in the schema you're passing in. The issue would be working out how to represent the Excel data this way, and there are some who advise against using this, but it would mean that in PERMISSIVE mode you would have a way of getting to that data.
Another option would be to simply check for null values and filter out those rows, especially in instances where you know the column should not contain nulls.
The problem with Excel is you could end up with a lot of these as it's pretty (very) rubbish as a data interchange format, so blank rows, partial rows, etc... would very likely fail a schema check.
Have a look at the PR #27 and let me know if it's the kind of thing you're looking for
Hi Darren,
That was very quick Will check and get back to you
Thank you!
On Sun, 21 May, 2023, 3:03 pm Darren Fuller, @.***> wrote:
Have a look at the PR #27 https://github.com/elastacloud/spark-excel/pull/27 and let me know if it's the kind of thing you're looking for
— Reply to this email directly, view it on GitHub https://github.com/elastacloud/spark-excel/issues/26#issuecomment-1556130838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6TTQL5VSYRRG7T7SRDVWDXHHOOFANCNFSM6AAAAAAYH5BLR4 . You are receiving this because you authored the thread.Message ID: @.***>
I'm trying to avoid the list of house chores I need to do 😆
Lol!
On Sun, 21 May, 2023, 3:14 pm Darren Fuller, @.***> wrote:
I'm trying to avoid the list of house chores I need to do 😆
— Reply to this email directly, view it on GitHub https://github.com/elastacloud/spark-excel/issues/26#issuecomment-1556133039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6TTQI74WS3S2CVS3GI4XLXHHPW5ANCNFSM6AAAAAAYH5BLR4 . You are receiving this because you authored the thread.Message ID: @.***>
Hi Darren
Yes, PR#27 is what I am looking for, but now I'm getting another issue!
Weird, that looks like a problem reading the Excel source file. Is there a chance you're able to share it?
yes, this is the file I was trying to read. I placed this file inside a folder, and I have given folder path in spark read command. SampleData.xlsx
Hi Darren, Any update on this issue?
I've loaded up the sample data and can't re-create the error that you're seeing. I'm using a version of the library compiled for Spark 3.3.2 against Databricks Runtime 12.2 LTS
I am using DBR of 11.3LTS and downloaded spark-excel-3.3.0_0.1.10.jar file. Can you downgrade you DBR and try again if possible?which JAR File you are using for 12.2LTS?
EDIT Now I redownloaded the jar file and ran again, it is working fine. Small question from my end, if two files has two different schemas in a folder how to merge schemas(for csv's, we can use mergeSchema option) of two files?Right now it is only taking first file schema!
So that is by design, as Excel files tend to get put together by type and so have the same schema. If you know the schema in advance then providing it is always the better option, even when using CSV files as it will be more performant.
Original Data.Check the contractID in original data. Check after reading with inferschema the .0 has been added to id's
And also unlike csv badrecordspath while reading with schema is not working in this.Did someone tried schema withbadrecordspath?