databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
505 stars 227 forks source link

Fork for Spark-xml without Spark? #568

Closed DmitriyAlergant-T1A-Rogers closed 2 years ago

DmitriyAlergant-T1A-Rogers commented 2 years ago

Hi, is anyone aware of an existing spark-xml fork that would provide the same interface, but in a single-session Python process, without dependency on Spark? Returning a Pandas dataframe for example. Looking at the code it looks possible to do, with some significant refactoring but still preserving the 95% value of the existing codebase.

The use-case for that would involve parsing small-sized XML files (ex: legacy API interfaces) within a lightweight serverless environment (AWS Lambda) without having to run a Spark cluster, or experience minutes wait time on "serverless" spark pools startup.

We might fork and do it ourselves, but if someone heard of an existing fork that already does that, it would be fabulous.

srowen commented 2 years ago

It's not necessary at all. Just start a local Spark cluster in process. It's just a few lines of code to make a cluster with "local[*]" master and then you can use the full parallelism of your machine. Read with spark-xml and call toPandas() on the result

srowen commented 2 years ago

If you really just have small XML files though, this isn't the right tool. Just use any XML parser

DmitriyAlergant-T1A-Rogers commented 2 years ago

That's true, but we may have gotten used to and like this library interface and may have an existing code. And the files structure may be complex.

Also, I believe AWS Lambda environment is still too constrained to start up local Spark clusters within it; I found stories of some folks having achieved that, and even them tried to use Lambda for workers, not the driver. And does not look like a simple and neat solution at all.

srowen commented 2 years ago

How about pandas.read_xml? A Spark 'cluster' can be tiny and entirely in-process with local[*]. It's probably quite viable