Closed DmitriyAlergant-T1A-Rogers closed 2 years ago
It's not necessary at all. Just start a local Spark cluster in process. It's just a few lines of code to make a cluster with "local[*]" master and then you can use the full parallelism of your machine. Read with spark-xml and call toPandas() on the result
If you really just have small XML files though, this isn't the right tool. Just use any XML parser
That's true, but we may have gotten used to and like this library interface and may have an existing code. And the files structure may be complex.
Also, I believe AWS Lambda environment is still too constrained to start up local Spark clusters within it; I found stories of some folks having achieved that, and even them tried to use Lambda for workers, not the driver. And does not look like a simple and neat solution at all.
How about pandas.read_xml? A Spark 'cluster' can be tiny and entirely in-process with local[*]. It's probably quite viable
Hi, is anyone aware of an existing spark-xml fork that would provide the same interface, but in a single-session Python process, without dependency on Spark? Returning a Pandas dataframe for example. Looking at the code it looks possible to do, with some significant refactoring but still preserving the 95% value of the existing codebase.
The use-case for that would involve parsing small-sized XML files (ex: legacy API interfaces) within a lightweight serverless environment (AWS Lambda) without having to run a Spark cluster, or experience minutes wait time on "serverless" spark pools startup.
We might fork and do it ourselves, but if someone heard of an existing fork that already does that, it would be fabulous.