replicate experiments in Figure 5

zhczhong commented 3 years ago

Hi Jiacheng,

I want to reproduce the experiment of Fig.5 in your paper Updatable Learned Index with Precise Positions. But the result in my benchmark shows that lipp is not as strong as the paper showed.

Is it convenient to share your micro-benchmark code to help me reproduce that figure?

Thanks in advance!

Regards, Zhicong

Jiacheng-WU commented 3 years ago

It is somehow wired but might be reasonable. Our micro-benchmark is similar to our examples.

In fact, we do some experiments on different machines, and in some machines, the results could be worse. The performance of different learned indexes is somehow decided by the CPU frequency and Memory Bandwidth.

Besides, in order to get better performance in our indexes, you could use the "main" branch and slightly adjust the dataset to avoid the precision problem. Some fixes in the "develop" branch would influence the performance. In fact, at that time, SOSD is not released and thus we ignore some issues you mentioned before.

Moreover, you could customize the "compute_gap_count" function at line 42 in lipp/lipp.h when you have more memory. You could give more gaps in the nodes to reduce the adjustment and achieve better performance. But it may consume more memory.

zhczhong commented 3 years ago

Thanks for your quick reply! When I use the code in the main branch, performance is better a bit but still not always stronger than alex. Besides, when I use the code in main branch, I could not bulkload ycsb dataset.

Could you kindly tell me what gap_count you use in your paper and how to adjust the dataset to avoid the precision problem? Besides, what is the CPU frequency and memory bandwidth in your experiment? If possible, I will use the same configuration machine.

Regards, Zhicong

Jiacheng-WU commented 3 years ago

For YCSB, One possible way is to remove the duplicated keys after you transform uint64_t to double. Two consecutive uint64_t keys would be the same after converted it to double due to the lost precision. The gap_count we use is 5 for most cases, which would incur some memory overhead. As for the CPU frequency and memory bandwidth, since I cannot access the server, I cannot tell you them precisely. However, I remember that our index could behave well on larger memory bandwidth machines with less CPU frequency. I suggest you could test those indexes on different configurations of machines.

In fact, during the experiments, we find that the hardware configuration and microarchitecture really influenced the performance. We have used Intel Vtune to analyze the bottleneck, which is really different in different architectures. Thus, we just try our best to tune the performance on our machine. Another possible reason might be the version of ALEX. Due to the start date of our project, we can only use the early version of ALEX, which would also influence the performance of ALEX. Moreover, we are also interested in studying how the behaviors of learned indexes are influenced by the hardware configuration and microarchitecture, including instruction cycles and pipelining.

Sincerely, Jiacheng

zhczhong commented 3 years ago

Thanks for your reply! But I still have several questions.

YCSB and lognormal are dataset with 64 bytes integer keys. Did you transform all of them into double type in your paper experiment?
You mentioned that "you try your best to tune the performance on your machine". Could you tell me what kind of optimization that you have done? And do you apply the same optimization to other competitor indexes?

Regards, Zhicong

Jiacheng-WU commented 3 years ago

No, it is not necessary to convert to double type and pass them to the index. What I mean is to remove some keys which are the same (i.e. too close considering the precision) after converted to double type. But the remaining keys are still uint64_t. The main reason is that we convert uint64_t to double in the calculation of models. In fact, I remember that the earlier ALEX has the same issues at that time.
Here what I mean is that we have adjusted the code many times based on the analysis from Vtune on our machines. Once we find some code fragment would be a bottleneck, we then try our best to relieve or resolve the bottleneck, including the layout of our nodes, bitmap management, small node allocation, etc. But those optimizations are specific to our index structure and are also related to the concrete microarchitecture given that VTune shows the analysis of memory latency, cache miss, and instruction retired. In a word, those optimizations come from the analysis, which is also not reasonable for other indexes. Thus, we do not apply those optimizations (or we cannot apply them) to other competitors.

Sincerely, Jiacheng

Jiacheng-WU / lipp

replicate experiments in Figure 5 #4