MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

spark workflow avro vs. mysql reading #223

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

What would be the performance gain from waiting to write MySQL rows until very end, instead, relying on avro files for check/breakpoints through the various phases?

Moreover, what is the affect of this coalesce here? If this has no output, would assume that downstream goes back up and might potentially re-run earlier points in the code:

if write_avro:
    records_df_combine_cols.coalesce(settings.SPARK_REPARTITION)\
    .write.format("com.databricks.spark.avro").save(self.job.job_output)

Would be worth investigating performance from this angle.

ghukill commented 6 years ago

Closing - switch to Mongo has helped dramatically.