elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.74k stars 24.68k forks source link

Allocate less memory within the bulk API #104469

Open masseyke opened 8 months ago

masseyke commented 8 months ago

Problem Description

When a bulk API request is sent to a remote node, it currently copies off byte arrays for each request within it rather than just reusing the bytes already in the large byte array copied over to the node for the whole bulk request. This results in a lot of unnecessary garbage collection. We could instead directly use the bytes in the large shared byte array, but to do so we need to do some ref counting so that we know when we can release the shared byte array -- releasing it too early would mean corrupted requests; releasing it too late would lead to OutOfMemoryErrors. This is going to require a good bit of work, and this ticket is just meant as a placeholder to track all of that work from a single place. The idea is that we will begin with ref counting BulkRequest objects, and work inward from there until we can safely change IndexRequest's source to read in.readReleasableBytesReference() (which reuses the underlying byte array) rather than in.readBytesReference() (which copies bytes). The broad outline is:

  1. Make BulkRequest RefCounted, making sure that its ref count is always zero when it is garbage collected and never zero when still in use. (#104471)
  2. Do the same for BulkShardRequest
  3. Do the same for the individual requests used by BulkRequest and BulkShardRequest -- BulkItemRequest, IndexRequest, UpdateRequest, and ReindexRequest.
  4. Update IndexRequest to use the underlying shared byte array for its source

Each of those steps will be made up of several PRs.

elasticsearchmachine commented 8 months ago

Pinging @elastic/es-data-management (Team:Data Management)