AccelerateWithOptane / lab

Request access to Optane powered bare metal infrastructure for performance-testing and analysis purposes
15 stars 3 forks source link

Scalable Caching for Managed Big Data Systems on Emerging Storage #27

Open jackkolokasis opened 4 years ago

jackkolokasis commented 4 years ago

If you are interested in filing a request for access to the Accelerate With Optane Community Lab for performance testing, optimization, and analysis, please fill out the details below. Contact Avi Deitcher at avi@packet.net with questions.

Name, email, company, job title

  1. Iacovos Kolokasis, kolokasis@ics.forth.gr, ICS-FORTH, Master’s Student
  2. Anastasios Papagiannis, apapag@ics.forth.gr, ICS-FORTH, PhD Student
  3. Foivos Zakkak, foivos@zakkak.net, Red Hat, Inc., R&D Senior Software Engineer
  4. Shoaib Akram, Shoaib.Akram@anu.edu.au, Australian National University, Professor
  5. Christos Kozanitis, kozanitis@ics.forth.gr, ICS-FORTH, Researcher
  6. Polyvios Pratikakis, polyvios@ics.forth.gr, ICS-FORTH, Professor
  7. Angelos Bilas, bilas@ics.forth.gr, ICS-FORTH, Professor

Note that projects with two or more participants are preferred.

Project Title and brief description

TeraCache: Scalable Caching for Managed Big Data Systems on Emerging Storage

Many analytics computations are dominated by iterative processing stages, executed until a convergence condition is met. To accelerate such workloads while keeping up with the exponential growth of data and the slow scaling of DRAM capacity, Spark employs off-memory caching of intermediate results. However, off-heap caching requires the serialization and deserialization (serdes) of data, which add significant overhead especially with growing datasets.This project explores TeraCache, an extension of the Spark data cache that avoids the need of serdes by keeping all cached data on-heap but off-memory, using memory-mapped I/O (mmio). To achieve this, TeraCache extends the original JVM heap with a managed heap that resides on a memory-mapped fast storage device and is exclusively used for cached data. Preliminary results show that the TeraCache prototype can speed up Machine Learning (ML) workloads that cache intermediate results by up to 37% compared to the state-of-the-art serdes approach.

How does the open source community benefit from your work?

We implement TeraCache in the open source OpenJDK-8 JVM. We currently support the open source big data analytics framework, namely, Apache Spark. In the future, we plan to support Apache Flink and Cassandra.

Is the code that you’re going to run 100% open source? If so, what is the URL or URLs where it is located?

OpenJDK is GPL-licensed and Spark is Apache-licensed. We will open-source our modifications under the same licenses, respectively. Now, because we still working on this and would like to protect our work until the planned publication is done, so that's the reason it's not already open sourced.

Does the infrastructure provided meet your testing needs (see: https://www.acceleratewithoptane.com/access/)?

Yes

Note that the configuration provided was created to enable testing flexibility across a range of potential use cases. Projects are expected to use one system due to limited supply. If additional resources are required, contact avi@packet.net

What performance-focused articles has your project published before?

We published a part of our work in the 12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’20) that is co-located with ATC’20 conference. (https://www.usenix.org/conference/hotstorage20/presentation/kolokasis). In our work we focus on performance and disk I/O for big data analytics systems.

Is your project intensely interested in performance, especially where disk I/O is concerned? Have you written about it or shared results of testing? Please share anything that shows your focus. Yes

Please state your contributions to the open source community and any other relevant initiatives

Feel free to brag a little bit about yourself! Research Experience

Industrial Experience:

Bragging:

Would you be willing to share your analysis and results publicly?

We do plan to submit our findings in highly prestigious conferences such as ASPLOS and Usenix ATC. We will provide artefacts and raw data. We will also present findings in meetups, for instance, the Spark Summit.

We are interested in blog posts, meetups and conference presentations. Accelerate With Optane would be more than happy to host your blog posts or link to them, and may coordinate performance-oriented meetups and conferences. Are you open to sharing?

Are you interested in testing Intel Optane SSDs with Intel Memory Drive Technology (IMDT)?

Indeed, we are interested in evaluating IMDT. We would like to compare TeraCache on servers with TBs of memory.

IMDT extends system memory transparently by integrating Intel Optane SSD capacity into the memory subsystem. The systems provided have 192GB of DRAM but can be enabled with 1.4TB of software-defined memory while leaving one Intel Optane SSD still available for fast storage/caching usage. Check here for more information on IMDT.