Performance problem in segment build

sajjad-moradi commented 2 years ago

We found out that during segment completion, segment build takes a long time. For one table it takes more than half an hour to build an immutable segment while it used to take around one minute to do so. After some investigation, it turned out that the root cause is in this PR: https://github.com/apache/pinot/pull/7595 More specifically the issue is with refactoring of BaseChunkSVForwardIndexWriter where we used to have one separate in-memory byte buffer to compress each chunk and then write its content to the index file:

      sizeToWrite = _chunkCompressor.compress(_chunkBuffer, _compressedBuffer);
      _dataFile.write(_compressedBuffer, _dataOffset);
      _compressedBuffer.clear();

After writing the chunk, the bytebuffer gets cleared and the same object will be reused in the next writeChunk call. Now after refactoring, the reusable byte buffer is gone and in every writeChunk call, small part of the index file gets memory mapped into a new MappedByteBuffer and and the chunk data is compressed to that mapped byte buffer which in turn automatically gets written into the index file.

    int maxCompressedSize = _chunkCompressor.maxCompressedSize(_chunkBuffer.limit());
    try (PinotDataBuffer compressedBuffer = PinotDataBuffer.mapFile(_file, false, _dataOffset,
        maxCompressedSize, ByteOrder.BIG_ENDIAN, "forward index chunk")) {
      ByteBuffer view = compressedBuffer.toDirectByteBuffer(0, maxCompressedSize);
      sizeWritten = _chunkCompressor.compress(_chunkBuffer, view);
    }

This may look better as it doesn't need an extra byte buffer for compression, but since the size of the chunk is very small - 1000 * data type size (8 bytes for long) - memory mapping degrades the performance [1]. We experimented a bit with the segments of the problematic table and turned out that even with SSD it takes more than 30% time to build the segment. For HDD, it's much worse and it takes more than 30x (one minute for using interim byte buffer vs more than 30 minutes for memory mapping).

[1] From Oracle documentation: For most operating systems, mapping a file into memory is more expensive than reading or writing a few tens of kilobytes of data via the usual read and write methods. From the standpoint of performance it is generally only worth mapping relatively large files into memory. https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,%20long,%20long)

sajjad-moradi commented 2 years ago

@kishoreg @richardstartin @Jackie-Jiang @mcvsubbu I'm going to create a PR to basically revert the code changes of BaseChunkSVForwardIndexWriter. Please take a look as this has blocked our deployment in Linkedin.

richardstartin commented 2 years ago

@sajjad-moradi the purpose of the change was to reduce the amount of memory required for variable length data when some of the values are very long, which results in very large buffers. An impact on build time was expected, but the impact on small fixed size chunks wasn’t considered. Memory mapping 4-8KB at a time doesn’t make sense and will result in very high syscall overhead, so this is less than ideal. We now have the V4 raw format for variable length data, so the benefits (reduced memory consumption) of the change are no longer important, so it makes sense to me to reintroduce the compressed chunk buffer. However, some of the changes in that PR were important, so the change should be limited to reintroducing the compression buffer and removing the memory mapping per chunk.

sajjad-moradi commented 2 years ago

Yes, as I mentioned earlier, only BaseChunkSVForwardIndexWriter needs to be reverted, not all the changes in that PR. The fix PR is linked here.

richardstartin commented 2 years ago

For some context about the change and what this will go back to, if you happen to have an outlier record in a set of JSON records, of say 1MB (which isn’t that large) compared to 10KB on average, you need ~1GB for the buffer, then ~2GB for the compression buffer. Saving 2GB per raw index build is a really nice improvement, if that’s how you’re using Pinot. Perhaps the problem is sharing the logic, which isn’t particularly complicated, between fixed and variable length data sources?

mcvsubbu commented 2 years ago

If you want to allocate memory in the realtime context, please use the class RealtimeIndexOffheapMemoryManager . It will make sure to allocate memory as per the configuration set by the admin -- memory mapped, or direct, etc. It is also accounted for, and so can be used for provisioning purpose.

mcvsubbu commented 2 years ago

For some context about the change and what this will go back to, if you happen to have an outlier record in a set of JSON records, of say 1MB (which isn’t that large) compared to 10KB on average, you need ~1GB for the buffer, then ~2GB for the compression buffer. Saving 2GB per raw index build is a really nice improvement, if that’s how you’re using Pinot. Perhaps the problem is sharing the logic, which isn’t particularly complicated, between fixed and variable length data sources?

Yes, we should not modify general logic for outlier use cases. In this case, we had a production issue, and had to revert the deployment, and spend multiple days trying to reproduce the problem, narrowing down the commit and then identifying a problem within that commit.

richardstartin commented 2 years ago

If you want to allocate memory in the realtime context, please use the class RealtimeIndexOffheapMemoryManager . It will make sure to allocate memory as per the configuration set by the admin -- memory mapped, or direct, etc. It is also accounted for, and so can be used for provisioning purpose.

Before my change ByteBuffer.allocateDirect was used, but I'll make a note of this.

Yes, we should not modify general logic for outlier use cases. In this case, we had a production issue, and had to revert the deployment, and spend multiple days trying to reproduce the problem, narrowing down the commit and then identifying a problem within that commit.

I apologise for the inconvenience caused by the change, but it's anything but an outlier use case. New user experience has to be considered - if a new user stores text (web scraping, XML, JSON, typically no control over max length) in a raw index, they will find it requires a lot of memory. How do we prevent that new user from having a bad experience at the same time as avoiding finding out about inadvertent regressions in your production environment? Just not making changes would be a good way to avoid regressions in your production environment, but it leaves new users whose workloads are slightly different to yours out in the cold. To avoid a repeat incident, firstly, we need better performance tests so that unconsidered codepaths can't regress without anyone knowing. Secondly, I repeat the suggestion to decouple fixed width and variable width storage, because the implementation has a downside either way so long as it's shared.

mcvsubbu commented 2 years ago

If you want to allocate memory in the realtime context, please use the class RealtimeIndexOffheapMemoryManager . It will make sure to allocate memory as per the configuration set by the admin -- memory mapped, or direct, etc. It is also accounted for, and so can be used for provisioning purpose.

Before my change ByteBuffer.allocateDirect was used, but I'll make a note of this.

I meant if you wanted to change anything in the realtime mem allocation path. Code has evolved, and more and more special casing has been added since the time the memory manager was introduced. Not all changes have followed the approach. Perhaps because it did not matter either way. As we moved forward to make changes, we should keep that in mind. Now that you have done so, I am fine, thanks.

Yes, we should not modify general logic for outlier use cases. In this case, we had a production issue, and had to revert the deployment, and spend multiple days trying to reproduce the problem, narrowing down the commit and then identifying a problem within that commit.

I apologise for the inconvenience caused by the change, but it's anything but an outlier use case. New user experience has to be considered - if a new user stores text (web scraping, XML, JSON, typically no control over max length) in a raw index, they will find it requires a lot of memory. How do we prevent that new user from having a bad experience at the same time as avoiding finding out about inadvertent regressions in your production environment? Just not making changes would be a good way to avoid regressions in your production environment, but it leaves new users whose workloads are slightly different to yours out in the cold. To avoid a repeat incident, firstly, we need better performance tests so that unconsidered codepaths can't regress without anyone knowing. Secondly, I repeat the suggestion to decouple fixed width and variable width storage, because the implementation has a downside either way so long as it's shared.

Perhaps we differ a bit here. A new user of today is just starting off. We must definitely be cognizant of revenue-impacting production use cases already running on the system. Today's new user will be tomorrow's production user, and will not be happy if a new release breaks something in production :-) So, we should be careful if we consider new use cases to be the most common ones.

That being said, there should be a balance. I am definitely not suggesting not making any changes, and I believe you know that :-). And some of these (especially performance bugs) are hard to catch before we go live. We do have performance testing as well in our pipeline, but this was not caught. Exactly one out of 150+ use cases happened to break :-) , so I would go as far to claim that it is impossible to guarantee that no regressions happen. We all commit code and do the best we can to be correct, and learn along the way as well.

If you can please work with @sajjad-moradi and arrive at a common solution that will be appreciated, thanks.

richardstartin commented 2 years ago

I think the best path forward is to merge the PR with the amendment to use the compression bound instead of 2x chunk size and then find a way to decouple fixed and variable byte raw index versions so that V4 can become the default for raw text without affecting the raw fixed width index.

sajjad-moradi commented 2 years ago

Yeah let's merge the PR and then I'll create an issue to handle the case of large buffer size

richardstartin commented 2 years ago

V4 solves the buffer size problem

sajjad-moradi commented 2 years ago

@richardstartin is V4 currently being used or is to going to be used in future?

richardstartin commented 2 years ago

@sajjad-moradi yes, but it requires field level config. I will open an issue to discuss making it default.

sajjad-moradi commented 2 years ago

The PR is merged. Closing the issue.

apache / pinot

Performance problem in segment build #7929