Refactor VW cache distribution

eHarmony / spotz

Spark Parameter Optimization and Tuning

31 stars 8 forks source link

Refactor VW cache distribution #37

Open vsuthichai opened 7 years ago

vsuthichai commented 7 years ago

There's a slowdown with VW cache distribution during at the beginning of the Spark job. Refactor this logic to zip, and distribute the vw dataset to the executors before VW cache generation begins

vsuthichai commented 7 years ago

Local mode will do cache generation a single time only, unlike when executing over the cluster which requires cache generation on every node.