schema not respected when reading multiple xml files

JohnStokes228 commented 1 year ago

I'm attempting to read in a large number of individual xml files into a spark dataframe. In order to do this using spark-xml I have defined a custom schema. when asking to read the batch in (using wildcard in the folder), spark raises an error that there are duplicates of a certain column. From manually checking the files I know this to not be the case.

through experimentation I have been able to identify an odd occurance, namely:

in a list of N files, i get the above error of duplicate columns
I am able to read the first M files in to a dataframe using my schema fin the options
weirdly I am also able to read the remaining N-M files into another dataframe, also using my schema in the options
when producing these two dataframes, their df.schema objects do not match to either each other or to the schema I initially specified

any ideas whats going wrong here? I was under the impression that by providing a schema in the read options, I would be enforcing the schema for the whole table but this is clearly not the case. It also seems odd to me that both halves of the file list can separately be read in using that schema but not together. is this an issue where the schema i have defined is just not being used somehow?

It may be worth noting here that every column in the xmls has a minoccurances of 0 i.e. may or may not exist in any given file, so the schema i am applying to the data on read in uses the nullable=True variable to represent this.

cheers :)

srowen commented 1 year ago

Please post more details. What is your schema, what does it infer differently, etc. I imagine your files do not quite have the same schema and the schema you supply isn't superset of them either. This would be consistent: some files infer one schema, others infer another, and they don't both match what you think

JohnStokes228 commented 1 year ago

I'll need to seek permission to share the schema.

whilst its not impossible the files do not share entirely the same schema, this doesn't quite line up with the issue in my mind, namely that i have provided both batches with an explicit schema to be read in with and both have successfully read in under those rules, but both cannot be read in together, and that calls to df.schema on resulting dataframes do not match.

to be clear I am not using schema inference in this operation, but instead explicitly defining it.

srowen commented 1 year ago

Can you create a reproducer that narrows it down? Are you saying that you specify schema X and it correctly reads subsets A and B, but not A+B? Above I thought you were saying X fails with A+B, but the inferred schemas for A and B work, but aren't the same.

Can you share the error? duplicate col could mean: your schema itself has a duplicate, or perhaps, a col is expected once but occurs several times.

JohnStokes228 commented 1 year ago

I'll look to create a minimal reproduction :)

in the meantime, yeah sorry my explanation wasnt clear - I do indeed mean that I specify schema X and it correctly reads subsets A and B, but not A+B. for the resultant dataframes, df.schema does not yield X in either case.

the error itself looks like:

files = 'file/path/*.xml'
read_options = {'rowTag': 'TAG', 'multiline': 'true', 'schema': custom_schema}
df = spark.read.format('xml').options(**read_options).load(files)

>>> AnalysisException: Found duplicate column(s) in the data schema: 'COLNAME'

whilst it would seem like there was some issue with duplicate columns it doesnt appear thats the case from the source data at all, and surely if there was such an issue, the files would never work rather than being able to be read in as two separate chunks?

srowen commented 1 year ago

You're sure custom_schema doesn't have a duplicate? And out of curiosity, what happens if you set "inferSchema" to False? shouldn't matter but just grasping for ideas

JohnStokes228 commented 1 year ago

I think this was me being very dumb actually sorry,

the .options method ignores 'schema' as an argument, hence the resultant tables not matching. when calling schema=custom_schema inside the .load() method it works as intended. could be nice for options to more explicitly error rather than silently not doing things when you give it fake commands but otherwise yeah this one was on me!

cheers again,

John

srowen commented 1 year ago

Oh I see, right it needs to be .schema(). I think that part might be handled by Spark rather than this library.

databricks / spark-xml

schema not respected when reading multiple xml files #610