std::bad_alloc while running TPC-DS Q4

uhyonc commented 2 years ago

Hi all,

We've been getting std::bad_alloc exceptions in TPC-DS Q4. The actual exception usually seems to happen in the JOIN stage between customer and catalog_sales (randomly stages 22, 30, 39), but not always.

The symptoms are that we see a high number of TaskManager::getResults calls... example:

I1101 18:14:26.092751 147346 TaskManager.cpp:592] TaskManager::getResults <taskId>.39.0.3, 1, 75
I1101 18:14:26.122815 147347 TaskManager.cpp:96] TaskManager::acknowledgeResults <taskId>.7.0.12, 10, 213
I1101 18:14:26.123914 152670 PrestoExchangeSource.cpp:90] Fetching data from <Host:Port><taskId>.7.0.1/results/3/257

Followed by allocation backoffs. In one, example I saw 94 of the backoff log entries

I1101 18:14:26.347635 152677 AsyncDataCache.cpp:587] Backoff in allocation contention for 16920 us.
I1101 18:14:26.347688 152670 AsyncDataCache.cpp:587] Backoff in allocation contention for 7425 us.
I1101 18:14:26.347718 152654 AsyncDataCache.cpp:587] Backoff in allocation contention for 38290 us.
I1101 18:14:26.347733 152684 AsyncDataCache.cpp:587] Backoff in allocation contention for 8619 us.

Finally the memory allocation related issue... (the actual place of the memory allocation failure can be different)

W1101 17:25:32.659839 143603 MmapAllocator.cpp:251] Could not advise away  enough for 294912 pages for allocateContiguous
E1101 17:25:32.735970 145039 Exceptions.h:68] Line: ../../velox/velox/exec/HashTable.cpp:505, Function:allocateTables, Expression:  Could not allocate join/group by hash table, Source: RUNTIME, Erro
rCode: INVALID_STAT

My current theory is that the random backoff delay (caused by high number of threads in allocation) is causing this issue, because the evict happens after the delay. So if there's a high number of concurrent allocation requests piled up, we don't try the "evict" fast enough... e.g. this part of the code in AsyncDataCache.cpp

    if (rank) {
      backoff(nthAttempt + rank);
    }
    ++shardCounter_;
    // Evict from next shard. If we have gone through all shards once
    // and still have not made the allocation, we go to desperate mode
    // with 'evictAllUnpinned' set to true.
    shards_[shardCounter_ & (kShardMask)]->evict(
        numPages * sizeMultiplier * MappedMemory::kPageSize,
        nthAttempt >= kNumShards);
    if (numPages < kSmallSizePages && sizeMultiplier < 4) {
      sizeMultiplier *= 2;
    }

I will be adding some more logs to get some more data around this point, but let me know if you have thoughts.

Thanks!

Explain plan for your reference: Explain_TPCDS_Q4.txt

mbasmanova commented 2 years ago

CC: @xiaoxmeng

mbasmanova commented 2 years ago

CC: @oerling @tanjialiang

oerling commented 2 years ago

If you have a concurrent high demand for pages for buffering IO and making a hash table of 1.2GB, it can be that the 1.2GB gets evicted and then before the allocation is actually completed somebody else allocates just enough memory that there's just a little less than the evicted size available when we get to the alocation itself.

So, whatever gets evicted on behalf of an allocation should be held in reserve and made not allocatable for anybody else. The structure is then probably that when the evict is done, we check that the free space is sufficient and then block all allocation for the duration. This is not very parallel but only applies to the relatively rare situation you are getting.

uhyonc commented 2 years ago

After trying some things out, here is what I've found. 1) In this query (for the join stage mentioned), the big fact table (catalog_sales) is getting hashed, thus requiring a lot of memory to hold this. (this is because our tables currently don't have stats, so CBO isn't working well...) 2) If I disable the Buffered Input Cache, then we quickly get "INSUFFICIENT RESOURCES" (VeloxRuntimeError: Exceeded memory cap of 40.00GB when requesting 4.00MB.) 3) It looks like enabling Buffered input cache, we are not able to detect hitting the per-node memory limits, which seems to cause thrashing, and bad_alloc exception before the cleaner "Exceeded memory cap" error. 4) Increasing the per-node memory cap for the query (e.g. to 70GB) seems to resolve this issue.

So I'm guessing it's a problem of our configs/data sizings rather than an issue with the code. So closing the ticket.

oerling commented 2 years ago

Firstly, Prestoo is known not to do many of the optimizer tricks DS requires. But even so, it should know better than to build on th fact table.

Secondly, running from cache is necessary for any benchmarks. So, we must stop other activity if there is a large allocation that does not make progress while having concurrent small ones.

When you set the query limit to 70G and run with cache, what happens? What is the total data size and what is the total worker memory?

Now,

uhyonc commented 2 years ago

So some more findings (after changing various configs)

node.memory_gb=80, query.max-memory-per-node=40GB (Buffered Input Cache on)

In Prestissimo, AsyncDataCache is being used for mappedMemory_. Also the value set on the size of this memory is by default 40GB. When I increase this by "node.memory_gb=80", then even with caching on, we get VeloxRuntimeError: Exceeded memory cap of 40.00GB when requesting 4.00MB

node.memory_gb=80, query.max-memory-per-node=70GB (Buffered Input Cache on)

The query runs successfully.

node.memory_gb=80, query.max-memory-per-node=70GB (Buffered Input Cache off)

The query runs successfully. This runs a bit faster (~10%) than without caching even vs consecutive runs of cache on.

oerling commented 2 years ago

Do the consecutive runs with cache actually hit? For example, is affinity on? (session property). If you have total memory set to 40 and query max to 80 then you will error out in the manner that you have seen. You could put a consistency check at startup to declra that query max is not over 90% of the max memory, i.e. capacity for AsyncDataCache.

Run the presto CLI with –debug . See where the data for the fragments comes from. ramReadBytes, localReadBytes, storageReadBytes.

If you have cache hits and it is worsre than no cache then we’d see the profile. Perf record –pid xxx for the worker.

What kind of platform do you run on? Local ssd? Data from S3 or HDFS?

Thanks

Orri

From: Uhyon Chung @.> Sent: Wednesday, November 2, 2022 11:32 AM To: facebookincubator/velox @.> Cc: oerling @.>; Mention @.> Subject: Re: [facebookincubator/velox] std::bad_alloc while running TPC-DS Q4 (Issue #3057)

So some more findings (after changing various configs)

node.memory_gb=80

In Prestissimo, AsyncDataCache is being used for mappedMemory_. Also the value set on the size of this memory is by default 40GB. When I increase this by "node.memory_gb=80", then even with caching on, we get VeloxRuntimeError: Exceeded memory cap of 40.00GB when requesting 4.00MB

node.memory_gb=80, query.max-memory-per-node=70GB (Buffered Input Cache on)

The query runs successfully.

node.memory_gb=80, query.max-memory-per-node=70GB (Buffered Input Cache off)

The query runs successfully. This runs a bit faster (~10%) than without caching even vs consecutive runs of cache on.

— Reply to this email directly, view it on GitHub https://github.com/facebookincubator/velox/issues/3057#issuecomment-1301055704 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKPPPTZP6Q743TCCB32BCHDWGKXRTANCNFSM6AAAAAARUK6HEE . You are receiving this because you were mentioned.Message ID: @.***>

uhyonc commented 2 years ago

Do the consecutive runs with cache actually hit? For example, is affinity on? (session property).

Not sure, I'll check it out.

If you have total memory set to 40 and query max to 80 then you will error out in the manner that you have seen. You could put a consistency check at startup to declra that query max is not over 90% of the max memory, i.e. capacity for AsyncDataCache. Run the presto CLI with –debug . See where the data for the fragments comes from. ramReadBytes, localReadBytes, storageReadBytes. If you have cache hits and it is worsre than no cache then we’d see the profile. Perf record –pid xxx for the worker.

I'll try out some of your suggestions here to get some more info.

What kind of platform do you run on? Local ssd? Data from S3 or HDFS?

Kubernetes on Debian linux. Data is from HDFS.

facebookincubator / velox

std::bad_alloc while running TPC-DS Q4 #3057