Closed Yneth closed 2 years ago
Thanks for submitting this. I think I would prefer to fix Dirigiste, if possible. PRs can be submitted at https://github.com/clj-commons/dirigiste
At a first glance, why isn't this being garbage-collected? Are you collecting stats yourself somehow? 35k seems like an awful lot stats objects.
@KingMob for some reason stats are not disposed. My fix was to implement Pool without stats at all. Unfortunately I do not have much time but I'll try to check what could be the reason.
@Yneth Can you share a minimal test case reproducing it?
Are you sure the 35'769
arrays of long are all referencing dirigiste
?
I don't ask you to check for all obviously, but can you randomly pin point some places to ensure it's really the case.
I'm asking this because a reservoir can contains at most 2048 elements (the entry you showed), so :
35769ร16408รท1024 = 573 MiB
and the value is twice as bigger.
@Yneth Even if you don't have a test case, did you save the profiling session? It would be nice to take a look at it.
@KingMob no, but I am sure that it gives problems. My fix was to create custom pool implementation without metrics, those that are using Reservoir.
It is a bit hard to create minimal test as I need to do some investigation beforehand and I am short on time.
My case looks the following way: 5 pools one per proxy and 1 additional without any proxy config, running http requests all day. it takes around 6 hours when problems arise.
Also I see uncatched error warning from manifold even though I am using m/catch everywhere, just saying in case it could also be a root cause.
@KingMob I guess giving a possibility to specify the whole pool would be nice, but it would require making (create-connection
function public. Not sure if it is ok to expose this so decided to go with the simplest draft solution
@KingMob I guess giving a possibility to specify the whole pool would be nice, but it would require making
(create-connection
function public. Not sure if it is ok to expose this so decided to go with the simplest draft solution
I'm generally OK with making it public, but leaving it out of the generated documentation via ^:no-doc
if needed. Most people won't need to replace the pool.
Also, if you can generate a memory profile, we'd still really appreciate it. It's harder to debug this without one.
@Yneth Any updates? After doing some work with Dirigiste recently, I want your PR. ๐
will send an update later today
Any update from your side?
What we actually need from you to determine if the issue is related to io.aleph.dirigiste.Stats$**
is the
number of those instances when your memory starts growing.
I wasn't able to reproduce on my side but the expected pattern is this:
I have three io.aleph.dirigiste.Executor
(4 reservoirs each) + one io.aleph.dirigiste.Pool
(4 reservoirs). (16 reservoirs).
Those are actually capped to 32'892 bytes (4096 * 8 bytes = 32'768 + a few other references)
UniformDoubleReservoir
do you have?io.aleph.dirigiste.Executor
they reference?io.aleph.dirigiste.Pool
they reference?From the printscreen you provided, we can only determine a lot of long have been allocated on the heap, but not what is the real issue.
Ok.. ok... I'm able to create a pattern where it can be problematic. With the following code, I'm able to create tons of Reservoir :
(require '[aleph.http :as http])
(require '[manifold.deferred :as d])
(require '[clj-commons.byte-streams :as bs])
(comment
(http/start-server (constantly {:body "hello!"}) {:port 8081})
(defn get [nb]
(apply d/zip (mapv (fn [_] (d/chain (http/get (format "http://%s.localhost:8081/" (java.util.UUID/randomUUID)))
:body
bs/to-string)) (range nb))))
(d/loop [d (get 10)]
(d/chain' d
(fn [_]
(prn "Done!")
(Thread/sleep 300)
(d/recur (get 10000))))))
Each different domain creates me a new Reservoir.
They are supposed to be cleaned up by the following code
// clear out any unused queues
_lock.lock();
for (Map.Entry<K,Stats> entry : _stats.entrySet()) {
K key = entry.getKey();
if (entry.getValue().getUtilization(1) == 0
&& _queues.get(key).objects.get() == 0) {
_queues.remove(key).shutdown();
// clean up stats so they don't remain in memory forever
_queueLatencies.remove(key);
_taskLatencies.remove(key);
_queueLengths.remove(key);
_utilizations.remove(key);
_taskArrivalRates.remove(key);
_taskCompletionRates.remove(key);
_taskRejectionRates.remove(key);
}
}
_lock.unlock();
https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L331-L332
It does its job when the conditions are met. I was putting the system under stress here but it might be somewhere along those lines of code where some objects are not removed from the queues.
https://github.com/clj-commons/aleph/blob/master/src/aleph/http.clj#L321-L375
I would be interested to know if you have the same pattern @Yneth.
Just a hunch: Not a bug on Dirigiste but most likely on the Aleph client.
And as @Yneth mentioned this:
Also I see uncatched error warning from manifold even though I am using m/catch everywhere, just saying in case it could also be a root cause.
@KingMob : I would be interested to know your opinion about my assumptions.
@arnaudgeiser Your test case, that crawls a lot of domains, reminds me a lot of:
As for the objects not being cleaned up:
aleph.http/request
will 99% of the time release
, not dispose
. Most normally-ended HTTP/1.1 connections are "Connection: keep-alive", so they won't meet the criteria for disposal.getUtilization(1)
is checking the 100th percentile of the utilization stats. This means that all utilizations in the stats time window must be 0. The reservoir size is 4096 items, and if the sample period is the default 10ms, then we should expect it to take ~41s of no usage for the reservoir to become all zeros. If looking at the memory usage too early is the problem, then lowering the sample period may help. It might be worth checking for bugs here, though, I know there have been some utilization calc issues in the past. (clj-commons/dirigiste#23)OK, I misread a couple things.
Your code is recurring with another 10000 domains. In my tests, this works out to ~770k per domain. Still highish, but there's 5 major maps of domain->reservoir, and each reservoir has an array of 4096 longs/double, so we have 10010 x 5 x 4096 x 8 bytes = ~1.6 GB. 770k per domain is still high, though; I know Dirigiste is tunable, we could probably make it consume less.
I also took a closer look at the utilizations, which is where it gets interesting. With one conn per domain, it's continually computing the utilization level as 2 (if I'm computing Double/longBitsToDouble correctly). In the code, it seems that queueLength
is always 1, available
is always 0, and objects
is always 0. (Not sure what objects
represents... I think it's the number of free objects.) Anyway, that means double utilization = 1.0 - ((available - queueLength) / Math.max(1.0, objects));
is always 2.
It feels like there's a bug here. Regardless of whether the conn is released or disposed, you'd think an unused connection shouldn't count towards utilization. I'm not sure if it's an edge artifact of having exactly 1 item per domain, or if this has always been a problem, and that's just where we see it first, when testing zillions of domains.
Just forced it to make 3 conns per domain, and all I got was endless utilization values of 4. I need to set this aside for now.
Even if it's a leak, it mostly affects people talking to many different domains, so I'm not sure how high priority it is.
Do not get me wrong, everything is fine with my test case. I'm only in trouble because I'm overwhelming the pool, but eventually, everything will be garbage collected if I stop doing requests. I just wanted to demonstrate a situation where a lot of memory could be allocated on the heap.
This pattern, which is not something we should expect from a production system is a situation where we could use a bunch of memory.
But again, without a heap dump here, we are stuck doing hypothesis which might be wrong.
Just forced it to make 3 conns per domain, and all I got was endless utilization values of 4. I need to set this aside for now.
Ah yes, interesting lead here.
I will spend some time to write unit tests for Dirigiste beause some parts of the code seems to be off, really.
A few examples:
public void sample(double n) {
int cnt = _count.incrementAndGet();
if (cnt <= RESERVOIR_SIZE) {
_values.set(cnt-1, Double.doubleToLongBits(n));
} else {
int idx = ThreadLocalRandom.current().nextInt(cnt);
if (idx < RESERVOIR_SIZE) {
_values.set(idx, Double.doubleToLongBits(n));
}
}
}
We can overflow here
(def atomic-integer (java.util.concurrent.atomic.AtomicInteger. Integer/MAX_VALUE))
(.incrementAndGet atomic-integer) ;; -2147483648
(.incrementAndGet atomic-integer) ;; -2147483647
Then, it will just blow the Reservoir
(def atomic-long-array (java.util.concurrent.atomic.AtomicIntegerArray. 4096))
(.set atomic-long-array -2147483648 0.5) ==> ArrayIndexOutOfBoundsException
Once the reservoir is filled, you have no guarantee the last idx
(position 4095) will be
set to the current utilization as we randomly get a slot over the 4096 position.
https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Stats.java#L61-L71
The last idx
(position 4095) is important here because we rely only on this
one to determine if the queue is no longer in use.
if (t == 1.0) {
return (double) vals[cnt-1];
}
https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Stats.java#L166-L168
Yeah, there's some weird stuff going on, with few tests, and a lot of that is why I'm not a fan of Dirigiste.
Based on a sampling period of 10ms, we should expect overflow and crash after ~8 months left untouched. However, the stats are meant to be replaced periodically. For both Executor and Pool, in startControlLoop
, the stats reservoirs are replaced once each control loop by updateStats()
. For Pool.updateStats
, the .toMap()
calls are actually destructive(!). It remove()
s the keys as it builds the map. So theoretically, overflow isn't an issue, but I find Dirigiste's logic indirect enough that I wouldn't bet on that.
Once the reservoir is filled, you have no guarantee the last idx (position 4095) will be set to the current utilization as we randomly get a slot over the 4096 position.
Actually, it's weirder than that. It picks a random number from 0 to cnt-1, and only if that's < 4096 does it store the sample at all. This has the effect of updating the reservoir less and less frequently as time goes on, until it's reset. The longer the control period, the less and less likely that later samples will have an effect. I'm not sure what the rationale is over using a ring buffer, and the commit messages are no help.
That being said, using the value in the last position when looking at the percentiles is fine. Inside the reservoir's toArray()
fns, you'll see it actually returns them sorted (!), so looking at the last element will always be the maximal utilization. Dirigiste isn't consdiering the most recent utilitzation value, it's considering the highest utilization in the control period.
The problem is that the utilization reservoir contains the same value everywhere, so it doesn't matter what percentile we look at. (The behavior above where later samples in the control period are less likely to be recorded is also a problem, but if the stats are properly reset, it should still work once we have a fallow period.)
So either a) the utilization is never being calculated correctly in our scenario (all scenarios?), or b) it's calculated correctly, but the stats aren't being reset, and thus by the time utilization drops to zero, the odds of it being recorded are low, and the odds of all non-zero samples being displaced is vanishingly low. I think it's more likely to be a; if b were true, I should be able to find other values somewhere, but I can't find any.
It's really annoying that .toArray and .toMap aren't simple transformation fns. Sorry you lost time on that.
Thanks for the explanations, I missed so many things here.
For both Executor and Pool, in startControlLoop, the stats reservoirs are replaced once each control loop by updateStats().
Woow... at least that explains the "high" rate allocation I was seeing.
For Pool.updateStats, the .toMap() calls are actually destructive(!). It remove()s the keys as it builds the map.
Right
Dirigiste isn't consdiering the most recent utilitzation value, it's considering the highest utilization in the control period.
Understood, I missed the fact that toArray
sorts the values. :face_with_spiral_eyes:
The problem is that the utilization reservoir contains the same value everywhere, so it doesn't matter what percentile we look at. (The behavior above where later samples in the control period are less likely to be recorded is also a problem, but if the stats are properly reset, it should still work once we have a fallow period.)
~This is currently bugging me. There is a correlation with Pool.updateStats
but according to what you explain, it's like most of the time the 4096 positions are totally useless.
I need to clean my head here.~
Nevermind, got it.
So either a) the utilization is never being calculated correctly in our scenario (all scenarios?), or b) it's calculated correctly, but the stats aren't being reset, and thus by the time utilization drops to zero, the odds of it being recorded are low, and the odds of all non-zero samples being displaced is vanishingly low. I think it's more likely to be a; if b were true, I should be able to find other values somewhere, but I can't find any.
Let's figure this out then!
Keep in mind, this is still low priority to my way of thinking. The Dirigiste pooling problem only applies to outgoing client connections, where talking to zillions of different servers is unlikely. The server uses Netty to pool incoming connections/channels, so it should be unaffected by Dirigiste's pool stats issue. The most likely scenario is a few outside servers are talked to, and a few MBs are wasted, but that's it.
If you're burning to fix it, I won't stop you tho ๐
Yes, I agree.
But some people are currently having issues with the io.aleph.dirigiste.Pool
while it's unlikely
they are currently talking to a lot of different servers.
By adding tests, I expect to see patterns where it could go wrong without having to communicate to thousands of different domains.
Going back to this PR, I'm still not sold to it but I can see reasons why people don't want to use Dirigiste on the current situation. According to the discussions we just had, it's quite an heavy machine under the hood. Let's provide an escape hatch then!
I know you want to sunset Dirigiste, but I would prefer stabilizing it. The code base is not that huge and the rationale behind it is still good IMO.
Whether we have no news from @Yneth on the following days, I will take over that PR.
I agree that other memory leaks of Dirigiste should be fixed. I'm just afraid that the pool leak is a red herring for their other issues.
And while I'd like to sunset it, that's more of a "next version" goal. Replacing it in current Aleph makes me think of these lines from the movie State and Main:
... you come by next Sunday, we'll give you a second chance.
Sister, the only second chance I know is the chance to make the same mistake twice.
What kind of escape hatch are you thinking of, if you don't want to use this PR?
I agree that other memory leaks of Dirigiste should be fixed. I'm just afraid that the pool leak is a red herring for their other issues.
I'm also afraid of this!
What kind of escape hatch are you thinking of, if you don't want to use this PR?
We'll use this PR. But I am resigned, not convinced!
@Yneth Are you still interested in this?
gentlemen excuse for the late response.
Regarding PR, I do not like it. Almost everything I've been thinking of is not very user friendly. Regarding dirigiste, I personally think it would be nice to fix it in a first place.
My use case involves querying up to 1000 hosts that change on daily basis. Think of a service like https://check-host.net/ that tracks statuses of the services every X minutes.
I do not promise, but I will try to provide heap dump later today
factory method of pool is dependant on local variables of the API method, as well as some private functions. Of course there should be an opportunity to override, but in a user friendly way
my temporary solution was to copy paste some parts and roll out my own method creating my own dirigiste pool without stats and it works fine and the PR itself is not high priority for me
@Yneth Don't restrict yourself on where or how to fix this. We're open to the right solution, and if you've followed the conversation so far, the most likely correct place is where Dirigiste computes pool utilization and/or cleans up old pool objects.
Also, based on your initial screenshot, which shows the leak is from the Stats reservoirs, I'm pretty sure Arnaud and I are looking at the same leak, so I don't think we need a dump anymore. (Of course, if that's not your leak, please let us know.)
Also, based on your initial screenshot, which shows the leak is from the Stats reservoirs, I'm pretty sure Arnaud and I are looking at the same leak, so I don't think we need a dump anymore. (Of course, if that's not your leak, please let us know.)
In any case, the heap dump can help us diagnose whether we are making the correct assumptions. We'll be able to track the state of those Stats.
@Yneth : Are you hitting the same ports on those hosts?
https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L97
I spent hours adding (JUnit) tests to Dirigiste and I start to have a better understanding of the situation and the meaning of those different values. (Consider what's below as a partial understanding of the situation, but I need to share some thoughts, some still need to be validated).
So..
Theorically, this calculation is supposed to be at least partially correct:
double utilization = 1.0 - ((available - queueLength) / Math.max(1.0, objects));
.acquire
and then .release
to the pool..acquire
waiting to get an object from the pool. (there is still something I don't understand on the code regarding the pending _takes
).acquire
Considering no object are present on the pool at the beginning:
An utilization of 1 means we are using all the objects that have been created available == 0
.acquire
(1 objects).acquire
=> .release
=> .acquire
(1 objects).acquire
=> .acquire
=> .acquire
(3 objects).acquire
=> .acquire
=> .dispose
(1 objects).acquire
=> .dispose
(0 objects, 0 available)An utilization of 0.5 means we are currently using half of the objects that have been created available*2 == objects
.acquire
=> .acquire
=> .release
(1 available, 2 objects)An utilization of 0 means all the objects are not in use on the pool available == objects (but not 0)
.acquire
=> .acquire
=> .release
=> .release
(2 objects).acquire
=> .acquire
=> .release
=> .dispose
(1 objects)To have an utilization > 1, we need to have pending .acquire
and available == 0. (it will happen when controller.shouldIncrement
returns false) [1]
.acquire
(2 utilization).acquire
=> .acquire
(3 utilization).acquire
=> .acquire
=> .acquire
(4 utilization)[1] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L269
PR : https://github.com/clj-commons/dirigiste/pull/32
I reached the conclusion that using a fixed value of 4096
is a waste.
static final int RESERVOIR_SIZE = 4096;
We should use controlPeriod/samplePeriod
instead.
By default
PR : https://github.com/arnaudgeiser/dirigiste/pull/1/files
While it can be seen as an optimization, replacing the stats every second is not ideal. A Ring Buffer might be more suitable for this use case.
PR : https://github.com/arnaudgeiser/dirigiste/pull/2
I will invalidate the above information as I progress. I missed a lot of tests on the Pool but I wanted to write my hypothesis first.
So either a) the utilization is never being calculated correctly in our scenario (all scenarios?), or b) it's calculated correctly, but the stats aren't being reset, and thus by the time utilization drops to zero, the odds of it being recorded are low, and the odds of all non-zero samples being displaced is vanishingly low. I think it's more likely to be a; if b were true, I should be able to find other values somewhere, but I can't find any.
Long story short:
It's definitely a
. When using an utilization controller, the objects will be removed as soon as they are not in
use which means we'll always end up on the concerning scenario (0 objects, 0 available) => utilization of 1.
The fix might be as simple as this:
double getUtilization(int available, int queueLength, int objects) {
if(objects==0 && queueLength==0) {
return 0;
}
return 1.0 - ((available - queueLength) / Math.max(1.0, objects));
}
However, we are now subject to race condition (youpi...).
Our Queue, that we get on .acquire
[1] might be destroyed at any moment by the startControlLoop
[2], and we might end up on IllegalStateException... already shutdown. [3]
So close and so far at the same time.
For the good news, I've been able to cover around 90%
of the Pool code base.
[1] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L409 [2] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L333 [3] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L163
I finally reached the bottom of this.
Here is the related PR on Dirigiste : https://github.com/clj-commons/dirigiste/pull/32
The PR is ready for review. I applied the @KingMob comments.
Even if we fix the dirigiste bugs, it would make sense to be able to implement your own Pool and Controller.
Sold!
Merged.
It might not be the best way to add such customisation, as it clashes with other fields (total-connections, target-utilization, max-queue-size, stats-callback) but it should work.
The rationale of this change is to add possibility to pick your own pool implementation. In my case dirigiste pool does not work due to memory leak in