how to benchmark dmdedup

venktesh-bolla commented 7 years ago

Hi Team,

I would like to benchmark dmdedup as described in documentation/paper published. In that, somewhere it is stated that "test exercise is done with 40 linux kernels",to see the level of deduplication with dmdedup. In the process of learning, i want to reproduce the claimed numbers. Will share the tabulated values as soon as I accomplish it.

Could you please share me some info about it, and shed some light.

Thanks in advance, Venkatesh.

venktesh-bolla commented 7 years ago

Any info or update?? -Venktesh.

sectorsize512 commented 7 years ago

What specific info is needed?

venktesh-bolla commented 7 years ago

Hi Vasily, I want to benchmark dmdedup. So to do this, I have got a hard-drive with 2 partitions. As stated in the dmdedup documentation, i want to reproduce those numbers or performance or rate of deduplication.

For instance, If I copy 100GB of data(includes several files like linux kernels) to dmdedup device, what is amount of meta data and data partition writes.

In short, I would like to reproduce the numbers which was tabulated in the dmdedup paper. Could you please tell, How exactly you guys did benchmark it?

Thanks in advance, Venkatesh

sectorsize512 commented 7 years ago

Hi, as the paper describes on page 10:

"Linux kernels (Figure 6). This dataset contains the source code of 40 Linux kernels from version 2.6.0 to 2.6.39, archived in a single tarball.
We first used an unmodified tar , which aligns files on 512B bound- aries ( tar-512). In this case, the tarball size was 11GB and the deduplication ratio was 1.18. We then modi- fied tar to align files on 4KB boundaries ( tar-4096). In this case, the tarball size was 16GB and the dedu- plication ratio was 1.88. Dmdedup uses 4KB chunking, which is why aligning files on 4KB boundaries increases the deduplication ratio. One can see that although tar- 4096 produces a larger logical tarball, its physical size (16GB / 1.88 = 8.5GB) is actually smaller than the tar- ball produced by tar-512 (11GB / 1.18 = 9 = 9.3GB"

We just used dd command to write corresponding data to dmdedup.

A word of warning - we did not use two paritions of the same HDD. Instead, we used a separte SSD for metadata. The paper has details on it:

https://www.fsl.cs.sunysb.edu/docs/ols-dmdedup/dmdedup-ols14.pdf

HTH, Vasily

venktesh-bolla commented 7 years ago

Thanks alot, that really helped. How can I create a tar ball with alignment??

Oliver-Luo commented 7 years ago

Another question: How do you test the random write? For seq write, dd is enough, though you still need another device to store the data and read from that device into dmdedup device. But I don't know how do you test the random write. Do you write a program to do that?

By the way, I'm still wandering how to create a tar ball with alignment as well. Would be helpful if you got any clue.

Thanks.

sectorsize512 commented 7 years ago

We used Filebench and modified it to generate data with required deduplication ratio. I'm attaching old FB patch to give you a sense.

I'm attaching tar patch to this post as well.

sectorsize512 commented 7 years ago

filebench-data-gen.diff.txt tar-4096-blocksize.diff.txt

dmdedup / dmdedup3.19

how to benchmark dmdedup #30