Open kkrugler opened 7 years ago
We could have simple LinkDB
with one entry for each unique domain. This would have the domain's mass (current & new), and a list of outdomains.
The LinkDBFunction
would be a Flink CoFlatMapFunction
. One input stream is domain1->domain2 tuples (keyed by domain1). This might trigger an update to the LinkDB. This stream comes from a Flink Source
or (eventually) the flink-crawler topology.
The other input stream is <domain, mass>
, which is mass that needs to go to domain
. A special value of domain
("epsilon-<task id>"
) means this is epsilon mass that should be evenly distributed to all domains. Mass gets added to the entry's "new" mass.
The <domain, mass>
stream is an iteration that is emitted by the LinkDBFunction
. It's continuously looping over its set of domains. At the start of each loop, it sets each entry's "new" mass to zero, and generates these output tuples based on the "old" mass. When it finishes with all domains, it emits the special epsilon entry. This has a domain set to identify the task id, as per above.
Then the LinkDBFunction
can wait until it's received the epsilon mass from all of the (running in parallel) LinkDBFunction
tasks. At this point it can compare current/new mass for domains, and decide whether it's converged.
I also should look at the TrustRank algorithm again, as that's a slightly different approach.
A simple first pass would be to create a completely separate topology that takes in link->link pairs, and implements the approach mentioned above (using a "LinkDB" function) to iterate on a stable solution.
Knowing when to stop is challenging, but for now we could just manually kill it.
We want to score URLs by how important their domain is, so that we can efficiently archive URLs from spammy link farms (among other things).
One approach is to do a simple check when generating outlinks, of whether the fetch pain domain and the outlink domain are different. If so, then emit a domain->domain result as well as the outlink.
We can split this off from the stream, and apply something like the Flink PageRank algorithm. Though this is batch oriented, and we'd like it to be continuous. We could save the domain->domain data in S3, and periodically run the batch job that generates new results.
We'd like to use these new results as another stream that feeds CrawlDB (so make that a CoRichFlatMap), so it can maintain an updated set of this data that it uses when scoring URLs during a merge.