databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Convert xml to dataframe based on pyspark - using rowValidationXSDPath #669

Closed yu-tracy closed 9 months ago

yu-tracy commented 9 months ago

Hi,

I want to read books.xml based on XSD schema, the following shows what I did.

XSD schema Screenshot 2023-11-08 at 16 06 39

Python code spark.read.format("xml").options(rowValidationXSDPath='books_schema.xsd').options(rowTag='catalog').load(books_path).show()

The result I got is Screenshot 2023-11-08 at 15 59 22

But it is not actually what I want, I expect to get like the following Screenshot 2023-11-08 at 16 00 15

I am not sure if my XSD schema is wrong, could you please help me to check it? Thank you!

And about the field rowValidationXSDPath, README.md shows The XSD does not otherwise affect the schema provided, or inferred. if it means, even though I provide xsd file, paser still infers schema, and xsd is only used for validation? :)

srowen commented 9 months ago

Note that the XSD here does not control the schema of the output. It is only there for validation.

The result is expected as your schema defines a result with multiple cols so it must be a struct. You can select book.* to pull up all the cols. This is not related to XML .

On Wed, Nov 8, 2023, 9:06 AM yu-tracy @.***> wrote:

Hi,

I want to read books.xml https://raw.githubusercontent.com/databricks/spark-xml/master/src/test/resources/books.xml based on XSD schema, the following shows what I did.

XSD schema `

xs:complexType xs:sequence xs:complexType xs:sequence ` Python code spark.read.format("xml").options(rowValidationXSDPath='books_schema.xsd').options(rowTag='catalog').load(books_path).show() The result I got is [image: Screenshot 2023-11-08 at 15 59 22] But it is not actually want I want, I expect to get like the following [image: Screenshot 2023-11-08 at 16 00 15] I am not sure if my XSD schema is wrong, could you please help me to check it? Thank you! - About the field rowValidationXSDPath, README.md shows The XSD does not otherwise affect the schema provided, or inferred. if it means, even though I provide xsd file, paser still infers schema, and xsd is only used for validation? :) — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
yu-tracy commented 9 months ago

Thanks for your reply! I would like my result directly shows multiple cols Screenshot 2023-11-09 at 09 51 07 instead of using select book.* to pull up all the cols. If that means I should change my XSD schema?

srowen commented 9 months ago

The XSD is unrelated to this. You can set your rowTag to book, is that all you need ?

On Thu, Nov 9, 2023, 2:53 AM yu-tracy @.***> wrote:

Thanks for your reply! I would like my result directly shows multiple cols [image: Screenshot 2023-11-09 at 09 51 07] https://user-images.githubusercontent.com/100492995/281672116-e4a73ff6-8f3d-44ff-b5a4-22a589c44b59.png instead of using select book.* to pull up all the cols. If that means I should change my XSD schema?

— Reply to this email directly, view it on GitHub https://github.com/databricks/spark-xml/issues/669#issuecomment-1803398515, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGIZ6UAX77GTVTCWMVWKGDYDSKXVAVCNFSM6AAAAAA7DBFK46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBTGM4TQNJRGU . You are receiving this because you commented.Message ID: @.***>

yu-tracy commented 9 months ago

If I let rowTag be book, I will get _corrupt_record. I am confused about it, sorry...

My code: spark.read.format("xml").options(rowValidationXSDPath='books_schema.xsd').options(rowTag='book').load(books_path).show()

Result: Screenshot 2023-11-09 at 14 34 14

Btw, you mentioned 'The result is expected as your schema defines a result with multiple cols so it must be a struct.' What does the mean of schema here? xsd schema?