ScaleUnlimited / flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-commons
Apache License 2.0
51 stars 18 forks source link

Add optional domain quality input to UrlDBFunction #143

Closed kkrugler closed 6 years ago

kkrugler commented 6 years ago

I think we'd have to change UrlDBFunction from a ProcessFunction to a CoProcessFunction.

The input would be Tuple2<String, Float> with PLD and average page score, which we'd use to adjust how many URLs we emit for a domain.

It's likely we'd want to save this as state, though I'm curious how state management works with a CoProcessFunction.