ScaleUnlimited / flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-commons
Apache License 2.0
51 stars 18 forks source link

Calculate a DomainRank as part of the crawling process #25

Open kkrugler opened 7 years ago

kkrugler commented 7 years ago

We want to score URLs by how important their domain is, so that we can efficiently archive URLs from spammy link farms (among other things).

One approach is to do a simple check when generating outlinks, of whether the fetch pain domain and the outlink domain are different. If so, then emit a domain->domain result as well as the outlink.

We can split this off from the stream, and apply something like the Flink PageRank algorithm. Though this is batch oriented, and we'd like it to be continuous. We could save the domain->domain data in S3, and periodically run the batch job that generates new results.

We'd like to use these new results as another stream that feeds CrawlDB (so make that a CoRichFlatMap), so it can maintain an updated set of this data that it uses when scoring URLs during a merge.

kkrugler commented 7 years ago

Some interesting papers on incremental PageRank:

kkrugler commented 7 years ago

We could have simple LinkDB with one entry for each unique domain. This would have the domain's mass (current & new), and a list of outdomains.

The LinkDBFunction would be a Flink CoFlatMapFunction. One input stream is domain1->domain2 tuples (keyed by domain1). This might trigger an update to the LinkDB. This stream comes from a Flink Source or (eventually) the flink-crawler topology.

The other input stream is <domain, mass>, which is mass that needs to go to domain. A special value of domain ("epsilon-<task id>") means this is epsilon mass that should be evenly distributed to all domains. Mass gets added to the entry's "new" mass.

The <domain, mass> stream is an iteration that is emitted by the LinkDBFunction. It's continuously looping over its set of domains. At the start of each loop, it sets each entry's "new" mass to zero, and generates these output tuples based on the "old" mass. When it finishes with all domains, it emits the special epsilon entry. This has a domain set to identify the task id, as per above.

Then the LinkDBFunction can wait until it's received the epsilon mass from all of the (running in parallel) LinkDBFunction tasks. At this point it can compare current/new mass for domains, and decide whether it's converged.

kkrugler commented 7 years ago

I also should look at the TrustRank algorithm again, as that's a slightly different approach.

kkrugler commented 7 years ago

Links on basic PageRank:

kkrugler commented 7 years ago

A simple first pass would be to create a completely separate topology that takes in link->link pairs, and implements the approach mentioned above (using a "LinkDB" function) to iterate on a stable solution.

Knowing when to stop is challenging, but for now we could just manually kill it.