databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

spark.read.format("xml").load(path) does not handle URIs with a comma (,) #605

Closed embrike closed 1 year ago

embrike commented 2 years ago

Hello,

We have experienced problems with reading URIs with a comma in them. Given a path on the form: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/file, name

From my understanding, this is a legal file name. It looks like the XML-library has some sort of further processing of the URI and removes everything after and including the comma, resulting in the error message: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/file

Any resolution to this?


Further stacktrace if needed: at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:137) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:303) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:303) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:303) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2746) at org.apache.spark.rdd.RDD.$anonfun$fold$1(RDD.scala:1193) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:419) at org.apache.spark.rdd.RDD.fold(RDD.scala:1187) at com.databricks.spark.xml.util.InferSchema$.infer(InferSchema.scala:95) at com.databricks.spark.xml.XmlRelation.$anonfun$schema$1(XmlRelation.scala:44) at scala.Option.getOrElse(Option.scala:189) at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:42) at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:236) at sun.reflect.GeneratedMethodAccessor349.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)

srowen commented 2 years ago

I don't think that's legal in paths used in Spark; I believe it will treat comma as a path delimiter. Can you escape the comma in the URI? In any event it won't be direclty related to this project itself.

embrike commented 2 years ago

Hello srowen,

Thanks for the fast reply. Appreciate it.

Do you by any chance have a link to any documentation in regards to legal file names in Spark? I have tried doing some research, but no luck.

If this is outside the scope of the library, we have to deal with it on our end. However, I have tested reading files with the same setup as described using spark.read.format(json), and it handles files with spacing and a comma in them.

Why does it differ when using the XML library?

srowen commented 2 years ago

Hm, it's probably because this uses the Hadoop FileInputFormat code path, and JSON may use the DSv2 path. Paths are generally interpreted as globs, where commas and other special chars have meaning. You can try escaping the comma with backslash? I don't know where/if it's documented, it may be, just recall this from experience.

embrike commented 2 years ago

Does not look like escaping the comma with a backslash does anything, unfortunately.

But thanks for the response. :)

Feel free to close the issue. The solution seems that we have to fix the file name on our side before being able to read them using the XML library.