Reading many xml files at once very slow

Youssefares commented 3 years ago

I wrote a simple function around reading xml files that I feed multiple paths. I have my file system mounted on an azure blob storage folder.

def read_xml(paths, schema, row_tag, value_tag):
  paths_str = ', '.join(paths)
  return spark.read \
    .format("com.databricks.spark.xml") \
    .schema(schema) \
    .option("rowTag", row_tag) \
    .option("valueTag", value_tag) \
    .load(paths_str)

Loading more than 1k paths at a time causes a lot of garbage collection messages at driver level. I am curious what is consuming memory at the driver level and if the use case expected for spark-xml is not to read so many files at once. Despite defining the schema myself and finding the paths before running the read (vs wild card path), the read command takes about 16 minutes for 1000 files. Shouldn't this be faster?

2020-11-29T12:01:53.358+0000: [GC (Allocation Failure) [PSYoungGen: 67369284K->159801K(67502592K)] 67652475K->442999K(203053056K), 0.0336629 secs] [Times: user=0.36 sys=0.01, real=0.04 secs] 
2020-11-29T12:02:24.336+0000: [GC (Allocation Failure) [PSYoungGen: 67385913K->164927K(67498496K)] 67669111K->448133K(203048960K), 0.1138776 secs] [Times: user=0.71 sys=0.38, real=0.12 secs] 
2020-11-29T12:03:11.514+0000: [GC (Allocation Failure) [PSYoungGen: 67391039K->164990K(67508224K)] 67674245K->448204K(203058688K), 0.1390703 secs] [Times: user=1.43 sys=0.55, real=0.14 secs] 
2020-11-29T12:03:57.115+0000: [GC (Allocation Failure) [PSYoungGen: 67403390K->168076K(67505152K)] 67686604K->451291K(203055616K), 0.1005766 secs] [Times: user=0.77 sys=0.37, real=0.10 secs] 
2020-11-29T12:04:43.577+0000: [GC (Allocation Failure) [PSYoungGen: 67406476K->166876K(67513856K)] 67689691K->450091K(203064320K), 0.1179351 secs] [Times: user=1.05 sys=0.61, real=0.12 secs] 
2020-11-29T12:05:28.253+0000: [GC (Allocation Failure) [PSYoungGen: 67416028K->167311K(67510272K)] 67699243K->450526K(203060736K), 0.1393799 secs] [Times: user=1.29 sys=0.67, real=0.14 secs] 
2020-11-29T12:06:12.751+0000: [GC (Allocation Failure) [PSYoungGen: 67416463K->171608K(67518976K)] 67699678K->454823K(203069440K), 0.0989740 secs] [Times: user=0.78 sys=0.32, real=0.10 secs] 
2020-11-29T12:06:57.808+0000: [GC (Allocation Failure) [PSYoungGen: 67432536K->169782K(67516928K)] 67715751K->452996K(203067392K), 0.0929449 secs] [Times: user=0.68 sys=0.17, real=0.09 secs] 
2020-11-29T12:07:42.664+0000: [GC (Allocation Failure) [PSYoungGen: 67430710K->174517K(67524096K)] 67713924K->457732K(203074560K), 0.1185074 secs] [Times: user=0.91 sys=0.18, real=0.12 secs] 
2020-11-29T12:08:27.877+0000: [GC (Allocation Failure) [PSYoungGen: 67446197K->176125K(67522560K)] 67729412K->459348K(203073024K), 0.1355439 secs] [Times: user=0.85 sys=0.82, real=0.14 secs] 
2020-11-29T12:08:45.417+0000: [GC (System.gc()) [PSYoungGen: 25929861K->29038K(67501056K)] 26213083K->426523K(203051520K), 0.0972965 secs] [Times: user=1.13 sys=0.64, real=0.10 secs]

srowen commented 3 years ago

Ultimately this is more a function of Spark than this library. I imagine you might see something similar reading as text. Loading a bunch of paths can consume resources in a number of ways. For example are they 1000 dirs each with lots of files? Can take a long time to list. You may end up with way more than 1000 partitions each of which is small so you could try repartitioning down by an order of magnitude if so. I don't think globs work (?) but could be worth seeing if it does. You can try a bigger driver.

Youssefares commented 3 years ago

Thanks for your reply! I see. Globs do work and little to no GC warnings are triggered. The paths themselves are to actual files with the goal being to avoid listing time which you already mentioned. So, I thought that listing the files myself would be faster than globs.

You can go ahead and close this, since there's no reason to believe spark-xml itself is doing the damage here. Might be worth following up on the different memory behavior though between globs and comma separated paths on spark's own repo.

srowen commented 3 years ago

OK good to be reminded that globs do work, and that helped here. Makes sense. Spark is a lot smarter about listing dirs when you can let it do so.

databricks / spark-xml

Reading many xml files at once very slow #499