databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Shortcut common type inference cases to fail fast, speed up inference #660

Closed srowen closed 12 months ago

srowen commented 1 year ago

In schema inference, many different types are tried out for each input. This can get really slow in some cases, especially where the true type is just 'string'. This adds several shortcuts in the type inference code, to fail fast before expensive parsing code is run, where it's clear the parsing won't work. This also avoids using a thrown exception in one case for better speed.

srowen commented 12 months ago

I've got a customer checking out this change too. If I put it in, I'll also need to get this applied to the oustanding patch vs Spark that ports this.