apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

Open ccat3z opened 2 weeks ago

ccat3z commented 2 weeks ago

Backend

VL (Velox)

Bug description

LocalParitionWrriter::mergeSpills use arrow::io::MemoryMappedFile to read all spill files. But it is not munmap in time, which results in a significant consumption of RssFile.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

zhztheplayer commented 2 weeks ago

Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files?

ccat3z commented 2 weeks ago

image The error message is same as https://github.com/apache/incubator-gluten/issues/6947, killed by yarn. But the reasons are different.

zhouyuan commented 2 weeks ago

CC @marin-ma

FelixYBW commented 1 week ago

So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here.

Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file.

FelixYBW commented 1 week ago

@jinchengchenghh, what's the way Velox used to merge spill file?

Most efficient way should be mmap + MADV_SEQUENTIAL + manually munmap

https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks

kecookier commented 1 week ago

Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files?

@zhztheplayer we caught this issue by dumping /proc/self/status when the executor is killed. It shows that RssFile is almost 3G when VmRss is 3.5G. This issue affects a number of our large-scale ETL processes. /proc/self/status image

So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here.

Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file.

@FelixYBW The Arrow MemoryMappedFile does not support a method to get the underlying mmap address. However, MemoryMappedFile::MemoryMap has a method to get the region data pointer, so we might add a method for MemoryMappedFile to return head() or data().

Another approach is in mergeSpills(), where we open the file for each partition and close it when finished. We do some internal performance test, it shows ReadableFile may have some regression.

@ccat3z Can you commit a PR and trigger the community performance benchmark by adding the comment /Benchmark Velox?

FelixYBW commented 1 week ago

benchmark velox doesn't have spill. How much performance regression? Is the regression compared with mmap file or vanilla spark?

FelixYBW commented 1 week ago

@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly.

ccat3z commented 1 week ago

@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly.

I tested the following patch on arrow, it seems that it will not be released immediately. It might still require explicit MADV_DONTNEED in this case.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 14:50:37.035869206 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -720,6 +727,27 @@
   return ::arrow::internal::MemoryAdviseWillNeed(regions);
 }
FelixYBW commented 1 week ago

return ::arrow::internal::MemoryAdviseWillNeed(regions); }

In io_util.cc, madvise(willneed) is called again. You may disable it.

FelixYBW commented 1 week ago

MemoryMappedFile has Region, when region is destructed, it called munmap. See how we can take use of it.

marin-ma commented 1 week ago

Another approach is in mergeSpills(), where we open the file for each partition and close it when finished.

@kecookier Not sure how much performance impact this might introduce. This approach requires invoking mmap and munmap for each partition, and some partitions in a single spill file may be quite small.

We do some internal performance test, it shows ReadableFile may have some regression.

How much performance regression do you see? Could you share some results? Thanks!

jinchengchenghh commented 1 week ago

Velox use SpillReadFile to read the file, it uses FileInputStream to read the file and simd::memcpy to copy the bytes, It will output batch RowVector one by one. FileInputStream uses velox::LocalReadFile pread or preadv to read the file.

As I see, it reads bufferSize which is controlled by QueryConfig kSpillReadBufferSize (default 1MB) one time. Note: if file system supports async read, read double bufferSize one time. @FelixYBW

readBytes = readSize();
      VELOX_CHECK_LT(
          0, readBytes, "Read past end of FileInputStream {}", fileSize_);
      NanosecondTimer timer_2{&readTimeNs};
      file_->pread(fileOffset_, readBytes, buffer()->asMutable<char>());

uint64_t FileInputStream::readSize() const {
  return std::min(fileSize_ - fileOffset_, bufferSize_);
}
/* Read data from file descriptor FD at the given position OFFSET
   without change the file pointer, and put the result in the buffers
   described by IOVEC, which is a vector of COUNT 'struct iovec's.
   The buffers are filled in the order specified.  Operates just like
   'pread' (see <unistd.h>) except that data are put in IOVEC instead
   of a contiguous buffer.

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern ssize_t preadv (int __fd, const struct iovec *__iovec, int __count,
               __off_t __offset) __wur;
ccat3z commented 1 week ago

How much performance regression do you see? Could you share some results? Thanks!

@marin-ma After internal re-testing last weekend, no noticeable performance regression was found.

ccat3z commented 1 week ago

In io_util.cc, madvise(willneed) is called again. You may disable it.

The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 18:09:06.540567675 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -660,8 +667,8 @@
   ARROW_ASSIGN_OR_RAISE(
       nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size()));
   // Arrange to page data in
-  RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
-      {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
+  // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
+  //     {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
   return memory_map_->Slice(position, nbytes);
 }
FelixYBW commented 1 week ago

If ReadableFile doesn't have significant perf loss, let's use the quick fix and optimize it to mmap in future. @marin-ma can you test #7861 on our jenkins with spill?

FelixYBW commented 1 week ago

Velox use SpillReadFile to read the file, it uses FileInputStream to read the file and simd::memcpy to copy the bytes, It will output batch RowVector one by one. FileInputStream uses velox::LocalReadFile pread or preadv to read the file.

The optimal way should be we map like 1M each time, unmap it once accessed. Looks MemoryMappedFile::Region can implement it.

FelixYBW commented 1 week ago

In io_util.cc, madvise(willneed) is called again. You may disable it.

The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 18:09:06.540567675 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -660,8 +667,8 @@
   ARROW_ASSIGN_OR_RAISE(
       nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size()));
   // Arrange to page data in
-  RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
-      {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
+  // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
+  //     {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
   return memory_map_->Slice(position, nbytes);
 }

Thank you for your test. Can you try posix_fadvise? We have a case to show madvise doesn't work on file mapping.

FelixYBW commented 1 week ago

Just found some test:

image

ccat3z commented 1 week ago

Just found some test:

image

Could you introduce what each columns in stdout represents?

FelixYBW commented 1 week ago

Could you introduce what each columns in stdout represents?

Not sure :( It's a test two years ago.

FelixYBW commented 1 week ago

some link about mmap we referred before.

http://xoyo.space/2017/11/mmap-performance-analyzing-and-tuning/ https://github.com/xoyowade/mmap_benchmark https://stackoverflow.com/questions/30470972/using-mmap-and-madvise-for-huge-pages https://lwn.net/Articles/590693/

https://www.usenix.org/system/files/conference/hotstorage17/hotstorage17-paper-choi.pdf

FelixYBW commented 1 week ago

summary: image

ccat3z commented 1 week ago

I wrote a simple gist to dump memory usage during reading file. Only MADV_DONTNEED and munmap can release RssFile immediately. MAP_POPULATE significantly improve performance, but MAP_POPULATE will use hundreds MB of uncontrolable RssFile, which is not acceptable in this case.

So I believe that the combination of manual MDAV_WILLNEED and MADV_DONTNEED is the best solution. I will test it in gluten later.