Open ccat3z opened 2 weeks ago
Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files?
The error message is same as https://github.com/apache/incubator-gluten/issues/6947, killed by yarn. But the reasons are different.
CC @marin-ma
So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here.
Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file.
@jinchengchenghh, what's the way Velox used to merge spill file?
Most efficient way should be mmap + MADV_SEQUENTIAL + manually munmap
https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks
Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files?
@zhztheplayer we caught this issue by dumping /proc/self/status
when the executor is killed. It shows that RssFile
is almost 3G when VmRss
is 3.5G. This issue affects a number of our large-scale ETL processes.
/proc/self/status
So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here.
Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file.
@FelixYBW
The Arrow MemoryMappedFile
does not support a method to get the underlying mmap
address. However, MemoryMappedFile::MemoryMap
has a method to get the region data pointer, so we might add a method for MemoryMappedFile
to return head()
or data()
.
Another approach is in mergeSpills()
, where we open the file for each partition and close it when finished. We do some internal performance test, it shows ReadableFile may have some regression.
@ccat3z Can you commit a PR and trigger the community performance benchmark by adding the comment /Benchmark Velox?
benchmark velox doesn't have spill. How much performance regression? Is the regression compared with mmap file or vanilla spark?
@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly.
@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly.
I tested the following patch on arrow, it seems that it will not be released immediately. It might still require explicit MADV_DONTNEED
in this case.
diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc 2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc 2024-11-09 14:50:37.035869206 +0800
@@ -41,6 +41,7 @@
#include <sstream>
#include <string>
#include <utility>
+#include <iostream>
// ----------------------------------------------------------------------
// Other Arrow includes
@@ -575,6 +576,12 @@
return Status::IOError("Memory mapping file failed: ",
::arrow::internal::ErrnoMessage(errno));
}
+ int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+ if (madv_res != 0) {
+ return Status::IOError("madvise failed: ",
+ ::arrow::internal::ErrnoMessage(errno));
+ }
+ std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
map_len_ = mmap_length;
offset_ = offset;
region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -720,6 +727,27 @@
return ::arrow::internal::MemoryAdviseWillNeed(regions);
}
return ::arrow::internal::MemoryAdviseWillNeed(regions); }
In io_util.cc, madvise(willneed) is called again. You may disable it.
MemoryMappedFile has Region, when region is destructed, it called munmap. See how we can take use of it.
Another approach is in mergeSpills(), where we open the file for each partition and close it when finished.
@kecookier Not sure how much performance impact this might introduce. This approach requires invoking mmap and munmap for each partition, and some partitions in a single spill file may be quite small.
We do some internal performance test, it shows ReadableFile may have some regression.
How much performance regression do you see? Could you share some results? Thanks!
Velox use SpillReadFile
to read the file, it uses FileInputStream
to read the file and simd::memcpy
to copy the bytes, It will output batch RowVector one by one. FileInputStream
uses velox::LocalReadFile
pread
or preadv
to read the file.
As I see, it reads bufferSize which is controlled by QueryConfig kSpillReadBufferSize
(default 1MB) one time. Note: if file system supports async read, read double bufferSize one time.
@FelixYBW
readBytes = readSize();
VELOX_CHECK_LT(
0, readBytes, "Read past end of FileInputStream {}", fileSize_);
NanosecondTimer timer_2{&readTimeNs};
file_->pread(fileOffset_, readBytes, buffer()->asMutable<char>());
uint64_t FileInputStream::readSize() const {
return std::min(fileSize_ - fileOffset_, bufferSize_);
}
/* Read data from file descriptor FD at the given position OFFSET
without change the file pointer, and put the result in the buffers
described by IOVEC, which is a vector of COUNT 'struct iovec's.
The buffers are filled in the order specified. Operates just like
'pread' (see <unistd.h>) except that data are put in IOVEC instead
of a contiguous buffer.
This function is a cancellation point and therefore not marked with
__THROW. */
extern ssize_t preadv (int __fd, const struct iovec *__iovec, int __count,
__off_t __offset) __wur;
How much performance regression do you see? Could you share some results? Thanks!
@marin-ma After internal re-testing last weekend, no noticeable performance regression was found.
In io_util.cc, madvise(willneed) is called again. You may disable it.
The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required.
diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc 2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc 2024-11-09 18:09:06.540567675 +0800
@@ -41,6 +41,7 @@
#include <sstream>
#include <string>
#include <utility>
+#include <iostream>
// ----------------------------------------------------------------------
// Other Arrow includes
@@ -575,6 +576,12 @@
return Status::IOError("Memory mapping file failed: ",
::arrow::internal::ErrnoMessage(errno));
}
+ int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+ if (madv_res != 0) {
+ return Status::IOError("madvise failed: ",
+ ::arrow::internal::ErrnoMessage(errno));
+ }
+ std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
map_len_ = mmap_length;
offset_ = offset;
region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -660,8 +667,8 @@
ARROW_ASSIGN_OR_RAISE(
nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size()));
// Arrange to page data in
- RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
- {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
+ // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
+ // {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
return memory_map_->Slice(position, nbytes);
}
If ReadableFile doesn't have significant perf loss, let's use the quick fix and optimize it to mmap in future. @marin-ma can you test #7861 on our jenkins with spill?
Velox use
SpillReadFile
to read the file, it usesFileInputStream
to read the file andsimd::memcpy
to copy the bytes, It will output batch RowVector one by one.FileInputStream
usesvelox::LocalReadFile
pread
orpreadv
to read the file.
The optimal way should be we map like 1M each time, unmap it once accessed. Looks MemoryMappedFile::Region can implement it.
In io_util.cc, madvise(willneed) is called again. You may disable it.
The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required.
diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc --- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc 2024-11-09 11:39:58.497266369 +0800 +++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc 2024-11-09 18:09:06.540567675 +0800 @@ -41,6 +41,7 @@ #include <sstream> #include <string> #include <utility> +#include <iostream> // ---------------------------------------------------------------------- // Other Arrow includes @@ -575,6 +576,12 @@ return Status::IOError("Memory mapping file failed: ", ::arrow::internal::ErrnoMessage(errno)); } + int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL); + if (madv_res != 0) { + return Status::IOError("madvise failed: ", + ::arrow::internal::ErrnoMessage(errno)); + } + std::cerr << "madvise success: " << result << " " << mmap_length << std::endl; map_len_ = mmap_length; offset_ = offset; region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result), @@ -660,8 +667,8 @@ ARROW_ASSIGN_OR_RAISE( nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size())); // Arrange to page data in - RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed( - {{memory_map_->data() + position, static_cast<size_t>(nbytes)}})); + // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed( + // {{memory_map_->data() + position, static_cast<size_t>(nbytes)}})); return memory_map_->Slice(position, nbytes); }
Thank you for your test. Can you try posix_fadvise? We have a case to show madvise doesn't work on file mapping.
Just found some test:
Just found some test:
Could you introduce what each columns in stdout represents?
Could you introduce what each columns in stdout represents?
Not sure :( It's a test two years ago.
some link about mmap we referred before.
http://xoyo.space/2017/11/mmap-performance-analyzing-and-tuning/ https://github.com/xoyowade/mmap_benchmark https://stackoverflow.com/questions/30470972/using-mmap-and-madvise-for-huge-pages https://lwn.net/Articles/590693/
https://www.usenix.org/system/files/conference/hotstorage17/hotstorage17-paper-choi.pdf
summary:
I wrote a simple gist to dump memory usage during reading file. Only MADV_DONTNEED
and munmap
can release RssFile immediately. MAP_POPULATE
significantly improve performance, but MAP_POPULATE
will use hundreds MB of uncontrolable RssFile, which is not acceptable in this case.
So I believe that the combination of manual MDAV_WILLNEED
and MADV_DONTNEED
is the best solution. I will test it in gluten later.
Backend
VL (Velox)
Bug description
LocalParitionWrriter::mergeSpills
usearrow::io::MemoryMappedFile
to read all spill files. But it is notmunmap
in time, which results in a significant consumption ofRssFile
.Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response