apache / fluo

Apache Fluo
https://fluo.apache.org
Apache License 2.0
187 stars 78 forks source link

Write a tool to find out the actual point at which MapReduce becomes better than Fluo. #116

Open cjnolet opened 10 years ago

cjnolet commented 10 years ago

One of the benchmarks done in the Percolator paper show that larger crawl rates (changes to 40% or more of the repository) are better for mapreduce indexing and less than 40% is better for Percolator. I'd like to see the same numbers for Fluo and Accumulo. There was a drastic point of exponential growth when the crawl rate went above the point where mapreduce is useful. Would our numbers look the same? Would we be able to support better crawl rates in Fluo? Can we strive to make this better over time?

Specifically, this tool can be used as we develop Accumulo and Fluo to

1) initially see what our numbers look like compared to Percolator proper, and 2) see if our numbers have changed (for better or for worse) as we continue to add new features and optimize

keith-turner commented 10 years ago

Definitely need something to test performance. The primary purpose of #24 was to test for correctness at scale. The google paper mentioned a test they wrote for clustering random documents with three random keys. I think the keys were random ints mod 10^9. I never was exactly sure what that test did.

cjnolet commented 10 years ago

Yeah performance test is a +1 for sure. It also seems like it'd be important as a benchmark as well so that the performance peak is blatantly obvious. Performance peak in this regard would mean the point at which mapreduce is actually a better option.

cjnolet commented 10 years ago

I'm thinking of starting with an algorithm that will allow us to dump some N number of entries into a Fluo table and then adjusting the percentage of the store to update through transactions. We should try to mimick a map/reduce process over the same data table (not using Fluo's API) to allow for the same functionality. I'd like to run the tests over several different percentage combinations and find the sweet spot. Hopefully, optimal crawl rate ratio will be similar (or better) than Percolator's (0%-40% is optimal for Fluo. >40% being optimal for Map/reduce). We should at least be able benchmark and

1) post our results 2) determine the impact of large design changes

cjnolet commented 9 years ago

Crawl Rate = % of repository updated per hour over a billion document repository.

Here's the excerpt from the paper:

"To quantify the benefits of moving from MapReduce to Percolator, we created a synthetic benchmark that clusters newly crawled documents against a billion document repository to remove duplicates in much the same way Google’s indexing pipeline operates. Documents are clustered by three clustering keys. In a real system, the clustering keys would be properties of the document like redirect target or content hash, but in this experiment we selected them uniformly at random from a collection of 750M possible keys. The average cluster in our synthetic repository contains 3.3 documents, and 93% of the documents are in a non-singleton cluster. This distribution of keys exercises the clustering logic, but does not expose it to the few extremely large clusters we have seen in practice. These clusters only affect the latency tail and not the results we present here. In the Percolator clustering implementation, each crawled document is immediately written to the repository to be clustered by an observer. The observer maintains an index table for each clustering key and compares the document against each index to determine if it is a duplicate (an elaboration of Figure 2). MapReduce implements clustering of continually arriving documents by repeatedly running a sequence of three clustering MapReduces (one for each clustering key). The sequence of three MapReduces processes the entire repository and any crawled documents that accumulated while the previous three were running."