Scalable Caching for Managed Big Data Systems on Emerging Storage

If you are interested in filing a request for access to the Accelerate With Optane Community Lab for performance testing, optimization, and analysis, please fill out the details below. Contact Avi Deitcher at avi@packet.net with questions.

Name, email, company, job title

Iacovos Kolokasis, kolokasis@ics.forth.gr, ICS-FORTH, Master’s Student
Anastasios Papagiannis, apapag@ics.forth.gr, ICS-FORTH, PhD Student
Foivos Zakkak, foivos@zakkak.net, Red Hat, Inc., R&D Senior Software Engineer
Shoaib Akram, Shoaib.Akram@anu.edu.au, Australian National University, Professor
Christos Kozanitis, kozanitis@ics.forth.gr, ICS-FORTH, Researcher
Polyvios Pratikakis, polyvios@ics.forth.gr, ICS-FORTH, Professor
Angelos Bilas, bilas@ics.forth.gr, ICS-FORTH, Professor

Note that projects with two or more participants are preferred.

Project Title and brief description

TeraCache: Scalable Caching for Managed Big Data Systems on Emerging Storage

Many analytics computations are dominated by iterative processing stages, executed until a convergence condition is met. To accelerate such workloads while keeping up with the exponential growth of data and the slow scaling of DRAM capacity, Spark employs off-memory caching of intermediate results. However, off-heap caching requires the serialization and deserialization (serdes) of data, which add significant overhead especially with growing datasets.This project explores TeraCache, an extension of the Spark data cache that avoids the need of serdes by keeping all cached data on-heap but off-memory, using memory-mapped I/O (mmio). To achieve this, TeraCache extends the original JVM heap with a managed heap that resides on a memory-mapped fast storage device and is exclusively used for cached data. Preliminary results show that the TeraCache prototype can speed up Machine Learning (ML) workloads that cache intermediate results by up to 37% compared to the state-of-the-art serdes approach.

How does the open source community benefit from your work?

We implement TeraCache in the open source OpenJDK-8 JVM. We currently support the open source big data analytics framework, namely, Apache Spark. In the future, we plan to support Apache Flink and Cassandra.

Is the code that you’re going to run 100% open source? If so, what is the URL or URLs where it is located?

OpenJDK is GPL-licensed and Spark is Apache-licensed. We will open-source our modifications under the same licenses, respectively. Now, because we still working on this and would like to protect our work until the planned publication is done, so that's the reason it's not already open sourced.

Does the infrastructure provided meet your testing needs (see: https://www.acceleratewithoptane.com/access/)?

Yes

Note that the configuration provided was created to enable testing flexibility across a range of potential use cases. Projects are expected to use one system due to limited supply. If additional resources are required, contact avi@packet.net

What performance-focused articles has your project published before?

We published a part of our work in the 12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’20) that is co-located with ATC’20 conference. (https://www.usenix.org/conference/hotstorage20/presentation/kolokasis). In our work we focus on performance and disk I/O for big data analytics systems.

Is your project intensely interested in performance, especially where disk I/O is concerned? Have you written about it or shared results of testing? Please share anything that shows your focus. Yes

Please state your contributions to the open source community and any other relevant initiatives

Feel free to brag a little bit about yourself! Research Experience

Graduate Research Assistant at ICS-FORTH:
- Evaluate different approaches to extend memory address space for JVM-based large-scale data analytics frameworks using fast storage devices
- Implement an extra storage level on Apache Spark to cache data off-heap on non-volatile memories

Schedule techniques for data placement for large scale graph analytics (Apache Spark GraphX).

Industrial Experience:

Intern Research Student
Improve/Enhance the Data Structure of the Syntax Tree of the SQL Parser
Also you can find more information to my CV: https://www.csd.uoc.gr/~kolokasis/files/kolokasis_cv.pdf

Bragging:

Earlier work on this project got Best Presentation Award, Usenix HotStorage 2020.

Would you be willing to share your analysis and results publicly?

We do plan to submit our findings in highly prestigious conferences such as ASPLOS and Usenix ATC. We will provide artefacts and raw data. We will also present findings in meetups, for instance, the Spark Summit.

We are interested in blog posts, meetups and conference presentations. Accelerate With Optane would be more than happy to host your blog posts or link to them, and may coordinate performance-oriented meetups and conferences. Are you open to sharing?

Are you interested in testing Intel Optane SSDs with Intel Memory Drive Technology (IMDT)?

Indeed, we are interested in evaluating IMDT. We would like to compare TeraCache on servers with TBs of memory.

IMDT extends system memory transparently by integrating Intel Optane SSD capacity into the memory subsystem. The systems provided have 192GB of DRAM but can be enabled with 1.4TB of software-defined memory while leaving one Intel Optane SSD still available for fast storage/caching usage. Check here for more information on IMDT.

AccelerateWithOptane / lab