bjkonglu / resume-bjkonglu

记录Spark、Flink研究经验
25 stars 7 forks source link

Structured Streaming实现两个流join的应用实践 #5

Open bjkonglu opened 6 years ago

bjkonglu commented 6 years ago

背景

在处理一个广告业务时,需要同时处理广告曝光日志和广告点击日志,同时需要知道一段时间内的广告曝光点击率。所以,需要对广告曝光日志和广告点击日志进行join操作。在调研过程中,发现spark-2.3.1版本里的Structured Streaming支持stream-stream joins.

stream-stream joins操作

实现stream-stream joins的代码示例如下:

import org.apache.spark.sql.functions.expr

val impressions = spark.readStream. ...
val clicks = spark.readStream. ...

// Apply watermarks on event-time columns
val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")

// Join with event-time constraints
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr("""
    clickAdId = impressionAdId AND
    clickTime >= impressionTime AND
    clickTime <= impressionTime + interval 1 hour
    """)
)