MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

exploring spark cluster #235

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

Some preliminary testing on what kind of configurations would need to be modified to support a spark cluster, as opposed to running locally via Livy.

Running as standalone cluster on machine appears to fairly dramatically speed up processing (leverages multiple cores), but will need some more testing and figuring to see how Combine can manage this process.

ghukill commented 6 years ago

If uploading jars to remote workers, also need to whitelist those directories in livy.conf, e.g. /usr/share/java/

ghukill commented 6 years ago

These have largely been implemented for default builds. Binding addresses and hosts are generalized to 0.0.0.0 and local where possible, but are listening outside as well.

Performance wise, running as a cluster show clear advantages over local in Livy. And, it positions it nicely for other nodes to participate. Closing.