databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
505 stars 227 forks source link

Clone HadoopConf to avoid cross usage of tags while parsing the xml #582

Closed sandeep-katta0102 closed 2 years ago

sandeep-katta0102 commented 2 years ago

This code is to fix the issue 581.

Added unit tests and also verified manually by using below code

import scala.collection.JavaConverters._
import scala.collection.mutable
val jobGroudId_ages = mutable.Set[Long]()

val threads_ages = (1001 to 1010).map { i =>
  new Thread {
    override def run() {
      sc.setJobGroup(s"$i", s"$i")
      val df = spark.read.option("rowTag", "person").format("xml").load("file:/Users/XXXX/spark-xml/src/test/resources/ages.xml") 
      if(df.schema.fields.isEmpty) {
        println(s"found repro for the ages run $i **********************")
        jobGroudId_ages.add(i)
      }
    }
  }
}

import scala.collection.JavaConverters._
import scala.collection.mutable
val jobGroudId_books = mutable.Set[Long]()

val threads = (1 to 10).map { i =>
  new Thread {
    override def run() {
      sc.setJobGroup(s"$i", s"$i")
      val df = spark.read.option("rowTag", "book").format("xml").load("file:/Users/XXXX/spark-xml/src/test/resources/books.xml") 
      if(df.schema.fields.isEmpty) {
        println(s"found repro for the book run $i **********************")
        jobGroudId_books.add(i)
      }
    }
  }
}

threads_ages.foreach(_.start())
threads.foreach(_.start())
threads_ages.foreach(_.join())
threads.foreach(_.join())
println(s" jobGroudId_books is ${jobGroudId_books.size} ")
println(s" jobGroudId_ages is ${jobGroudId_ages.size} ")

Before fix

image

After fix

image
HyukjinKwon commented 2 years ago

cc @srowen FYI if you find some time to take a look šŸ™

HyukjinKwon commented 2 years ago

@srowen just out of curiosity, when do we roughly plan to have the next release?

srowen commented 2 years ago

No particular schedule -- on demand. Is this is a sorta important fix? it's easy to roll a new release, and it has been 7 months or so since the last one, so seems OK to me.

HyukjinKwon commented 2 years ago

not super critical but I think it's good to have one ... could we make a release maybe? I will take a look and try the release around next week if you couldn't find to take a look šŸ‘

srowen commented 2 years ago

OK I can do it tomorrow I think

HyukjinKwon commented 2 years ago

Thank you

srowen commented 2 years ago

Done, 0.15.0 is released with this change

HyukjinKwon commented 2 years ago

Thanks!!!!