Closed Youssefares closed 3 years ago
Ultimately this is more a function of Spark than this library. I imagine you might see something similar reading as text. Loading a bunch of paths can consume resources in a number of ways. For example are they 1000 dirs each with lots of files? Can take a long time to list. You may end up with way more than 1000 partitions each of which is small so you could try repartitioning down by an order of magnitude if so. I don't think globs work (?) but could be worth seeing if it does. You can try a bigger driver.
Thanks for your reply! I see. Globs do work and little to no GC warnings are triggered. The paths themselves are to actual files with the goal being to avoid listing time which you already mentioned. So, I thought that listing the files myself would be faster than globs.
You can go ahead and close this, since there's no reason to believe spark-xml itself is doing the damage here. Might be worth following up on the different memory behavior though between globs and comma separated paths on spark's own repo.
OK good to be reminded that globs do work, and that helped here. Makes sense. Spark is a lot smarter about listing dirs when you can let it do so.
I wrote a simple function around reading xml files that I feed multiple paths. I have my file system mounted on an azure blob storage folder.
Loading more than 1k paths at a time causes a lot of garbage collection messages at driver level. I am curious what is consuming memory at the driver level and if the use case expected for spark-xml is not to read so many files at once. Despite defining the schema myself and finding the paths before running the read (vs wild card path), the read command takes about 16 minutes for 1000 files. Shouldn't this be faster?