databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
505 stars 227 forks source link

weird behaviour when there is a ">" at the end of RowTag #532

Closed kduvekot-wehkamp-nl closed 3 years ago

kduvekot-wehkamp-nl commented 3 years ago

By accident I had pasted the Row element name in my configuration with the trailing ">"

    df = (spark.read.format('xml')
         .options(rowTag='RowElement>')
         .load(file_location)
     )

It resulted in a valid dataframe with a schema that was expected from the data. however when I displayed the resulting dataframe it only had 7 rows instead of the 408238 rows I was expecting the rows were roughtly 58k apart .. so must be some "long line parsing" artifact.

so maybe its good to "sanitize" or "validate" the rowTag option to not include any XML restricted chars

srowen commented 3 years ago

Yeah it can't have the angle bracket. If it does it will probably read the closing bracket as part of the tag, then keep searching for the next angle bracket to close it, which would mean it skips some random data. Maybe I should have it throw an error in this case

kduvekot-wehkamp-nl commented 3 years ago

Cool .. thanks for the quick solution 👍