Open uhyonc opened 2 years ago
CC: @xiaoxmeng
CC: @oerling @tanjialiang
If you have a concurrent high demand for pages for buffering IO and making a hash table of 1.2GB, it can be that the 1.2GB gets evicted and then before the allocation is actually completed somebody else allocates just enough memory that there's just a little less than the evicted size available when we get to the alocation itself.
So, whatever gets evicted on behalf of an allocation should be held in reserve and made not allocatable for anybody else. The structure is then probably that when the evict is done, we check that the free space is sufficient and then block all allocation for the duration. This is not very parallel but only applies to the relatively rare situation you are getting.
After trying some things out, here is what I've found. 1) In this query (for the join stage mentioned), the big fact table (catalog_sales) is getting hashed, thus requiring a lot of memory to hold this. (this is because our tables currently don't have stats, so CBO isn't working well...) 2) If I disable the Buffered Input Cache, then we quickly get "INSUFFICIENT RESOURCES" (VeloxRuntimeError: Exceeded memory cap of 40.00GB when requesting 4.00MB.) 3) It looks like enabling Buffered input cache, we are not able to detect hitting the per-node memory limits, which seems to cause thrashing, and bad_alloc exception before the cleaner "Exceeded memory cap" error. 4) Increasing the per-node memory cap for the query (e.g. to 70GB) seems to resolve this issue.
So I'm guessing it's a problem of our configs/data sizings rather than an issue with the code. So closing the ticket.
Firstly, Prestoo is known not to do many of the optimizer tricks DS requires. But even so, it should know better than to build on th fact table.
Secondly, running from cache is necessary for any benchmarks. So, we must stop other activity if there is a large allocation that does not make progress while having concurrent small ones.
When you set the query limit to 70G and run with cache, what happens? What is the total data size and what is the total worker memory?
Now,
So some more findings (after changing various configs)
node.memory_gb=80, query.max-memory-per-node=40GB (Buffered Input Cache on)
In Prestissimo, AsyncDataCache is being used for mappedMemory_. Also the value set on the size of this memory is by default 40GB. When I increase this by "node.memory_gb=80", then even with caching on, we get
VeloxRuntimeError: Exceeded memory cap of 40.00GB when requesting 4.00MB
node.memory_gb=80, query.max-memory-per-node=70GB (Buffered Input Cache on)
The query runs successfully.
node.memory_gb=80, query.max-memory-per-node=70GB (Buffered Input Cache off)
The query runs successfully. This runs a bit faster (~10%) than without caching even vs consecutive runs of cache on.
Do the consecutive runs with cache actually hit? For example, is affinity on? (session property). If you have total memory set to 40 and query max to 80 then you will error out in the manner that you have seen. You could put a consistency check at startup to declra that query max is not over 90% of the max memory, i.e. capacity for AsyncDataCache.
Run the presto CLI with –debug . See where the data for the fragments comes from. ramReadBytes, localReadBytes, storageReadBytes.
If you have cache hits and it is worsre than no cache then we’d see the profile. Perf record –pid xxx for the worker.
What kind of platform do you run on? Local ssd? Data from S3 or HDFS?
Thanks
Orri
From: Uhyon Chung @.> Sent: Wednesday, November 2, 2022 11:32 AM To: facebookincubator/velox @.> Cc: oerling @.>; Mention @.> Subject: Re: [facebookincubator/velox] std::bad_alloc while running TPC-DS Q4 (Issue #3057)
So some more findings (after changing various configs)
node.memory_gb=80
In Prestissimo, AsyncDataCache is being used for mappedMemory_. Also the value set on the size of this memory is by default 40GB. When I increase this by "node.memory_gb=80", then even with caching on, we get VeloxRuntimeError: Exceeded memory cap of 40.00GB when requesting 4.00MB
node.memory_gb=80, query.max-memory-per-node=70GB (Buffered Input Cache on)
The query runs successfully.
node.memory_gb=80, query.max-memory-per-node=70GB (Buffered Input Cache off)
The query runs successfully. This runs a bit faster (~10%) than without caching even vs consecutive runs of cache on.
— Reply to this email directly, view it on GitHub https://github.com/facebookincubator/velox/issues/3057#issuecomment-1301055704 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKPPPTZP6Q743TCCB32BCHDWGKXRTANCNFSM6AAAAAARUK6HEE . You are receiving this because you were mentioned.Message ID: @.***>
Do the consecutive runs with cache actually hit? For example, is affinity on? (session property).
Not sure, I'll check it out.
If you have total memory set to 40 and query max to 80 then you will error out in the manner that you have seen. You could put a consistency check at startup to declra that query max is not over 90% of the max memory, i.e. capacity for AsyncDataCache. Run the presto CLI with –debug . See where the data for the fragments comes from. ramReadBytes, localReadBytes, storageReadBytes. If you have cache hits and it is worsre than no cache then we’d see the profile. Perf record –pid xxx for the worker.
I'll try out some of your suggestions here to get some more info.
What kind of platform do you run on? Local ssd? Data from S3 or HDFS?
Kubernetes on Debian linux. Data is from HDFS.
Hi all,
We've been getting std::bad_alloc exceptions in TPC-DS Q4. The actual exception usually seems to happen in the JOIN stage between customer and catalog_sales (randomly stages 22, 30, 39), but not always.
The symptoms are that we see a high number of TaskManager::getResults calls... example:
Followed by allocation backoffs. In one, example I saw 94 of the backoff log entries
Finally the memory allocation related issue... (the actual place of the memory allocation failure can be different)
My current theory is that the random backoff delay (caused by high number of threads in allocation) is causing this issue, because the evict happens after the delay. So if there's a high number of concurrent allocation requests piled up, we don't try the "evict" fast enough... e.g. this part of the code in AsyncDataCache.cpp
I will be adding some more logs to get some more data around this point, but let me know if you have thoughts.
Thanks!
Explain plan for your reference: Explain_TPCDS_Q4.txt