clj-commons / aleph

Asynchronous streaming communication for Clojure - web server, web client, and raw TCP/UDP
http://aleph.io
MIT License
2.54k stars 241 forks source link

add pool customization options #602

Closed Yneth closed 2 years ago

Yneth commented 2 years ago

It might not be the best way to add such customisation, as it clashes with other fields (total-connections, target-utilization, max-queue-size, stats-callback) but it should work.

The rationale of this change is to add possibility to pick your own pool implementation. In my case dirigiste pool does not work due to memory leak in

image
KingMob commented 2 years ago

Thanks for submitting this. I think I would prefer to fix Dirigiste, if possible. PRs can be submitted at https://github.com/clj-commons/dirigiste

At a first glance, why isn't this being garbage-collected? Are you collecting stats yourself somehow? 35k seems like an awful lot stats objects.

Yneth commented 2 years ago

@KingMob for some reason stats are not disposed. My fix was to implement Pool without stats at all. Unfortunately I do not have much time but I'll try to check what could be the reason.

KingMob commented 2 years ago

@Yneth Can you share a minimal test case reproducing it?

arnaudgeiser commented 2 years ago

Are you sure the 35'769 arrays of long are all referencing dirigiste? I don't ask you to check for all obviously, but can you randomly pin point some places to ensure it's really the case.

arnaudgeiser commented 2 years ago

I'm asking this because a reservoir can contains at most 2048 elements (the entry you showed), so :

35769ร—16408รท1024 = 573 MiB

and the value is twice as bigger.

KingMob commented 2 years ago

@Yneth Even if you don't have a test case, did you save the profiling session? It would be nice to take a look at it.

Yneth commented 2 years ago

@KingMob no, but I am sure that it gives problems. My fix was to create custom pool implementation without metrics, those that are using Reservoir.

It is a bit hard to create minimal test as I need to do some investigation beforehand and I am short on time.

My case looks the following way: 5 pools one per proxy and 1 additional without any proxy config, running http requests all day. it takes around 6 hours when problems arise.

Also I see uncatched error warning from manifold even though I am using m/catch everywhere, just saying in case it could also be a root cause.

Yneth commented 2 years ago

@KingMob I guess giving a possibility to specify the whole pool would be nice, but it would require making (create-connection function public. Not sure if it is ok to expose this so decided to go with the simplest draft solution

KingMob commented 2 years ago

@KingMob I guess giving a possibility to specify the whole pool would be nice, but it would require making (create-connection function public. Not sure if it is ok to expose this so decided to go with the simplest draft solution

I'm generally OK with making it public, but leaving it out of the generated documentation via ^:no-doc if needed. Most people won't need to replace the pool.

Also, if you can generate a memory profile, we'd still really appreciate it. It's harder to debug this without one.

KingMob commented 2 years ago

@Yneth Any updates? After doing some work with Dirigiste recently, I want your PR. ๐Ÿ˜„

Yneth commented 2 years ago

will send an update later today

arnaudgeiser commented 2 years ago

Any update from your side?

What we actually need from you to determine if the issue is related to io.aleph.dirigiste.Stats$** is the number of those instances when your memory starts growing.

I wasn't able to reproduce on my side but the expected pattern is this:

image

image

I have three io.aleph.dirigiste.Executor (4 reservoirs each) + one io.aleph.dirigiste.Pool(4 reservoirs). (16 reservoirs). Those are actually capped to 32'892 bytes (4096 * 8 bytes = 32'768 + a few other references)

From the printscreen you provided, we can only determine a lot of long have been allocated on the heap, but not what is the real issue.

arnaudgeiser commented 2 years ago

Ok.. ok... I'm able to create a pattern where it can be problematic. With the following code, I'm able to create tons of Reservoir :

image

(require '[aleph.http :as http])
(require '[manifold.deferred :as d])
(require '[clj-commons.byte-streams :as bs])

(comment
  (http/start-server (constantly {:body "hello!"}) {:port 8081})

  (defn get [nb]
    (apply d/zip (mapv (fn [_] (d/chain (http/get (format "http://%s.localhost:8081/" (java.util.UUID/randomUUID)))
                                        :body
                                        bs/to-string)) (range nb))))

  (d/loop [d (get 10)]
    (d/chain' d
              (fn [_]
                (prn "Done!")
                (Thread/sleep 300)
                (d/recur (get 10000))))))

Each different domain creates me a new Reservoir.

image

They are supposed to be cleaned up by the following code

// clear out any unused queues
_lock.lock();
for (Map.Entry<K,Stats> entry : _stats.entrySet()) {
    K key = entry.getKey();
    if (entry.getValue().getUtilization(1) == 0
        && _queues.get(key).objects.get() == 0) {
        _queues.remove(key).shutdown();

        // clean up stats so they don't remain in memory forever
        _queueLatencies.remove(key);
        _taskLatencies.remove(key);
        _queueLengths.remove(key);
        _utilizations.remove(key);
        _taskArrivalRates.remove(key);
        _taskCompletionRates.remove(key);
        _taskRejectionRates.remove(key);
    }
}
_lock.unlock();

https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L331-L332

It does its job when the conditions are met. I was putting the system under stress here but it might be somewhere along those lines of code where some objects are not removed from the queues.

https://github.com/clj-commons/aleph/blob/master/src/aleph/http.clj#L321-L375

I would be interested to know if you have the same pattern @Yneth.

Just a hunch: Not a bug on Dirigiste but most likely on the Aleph client.

arnaudgeiser commented 2 years ago

And as @Yneth mentioned this:

Also I see uncatched error warning from manifold even though I am using m/catch everywhere, just saying in case it could also be a root cause.

@KingMob : I would be interested to know your opinion about my assumptions.

KingMob commented 2 years ago

@arnaudgeiser Your test case, that crawls a lot of domains, reminds me a lot of:

  1. https://github.com/clj-commons/aleph/issues/394#issuecomment-404400781 in #394
  2. https://github.com/clj-commons/aleph/issues/461#issuecomment-453145969 in #461

As for the objects not being cleaned up:

  1. Dumb question: Manually running the GC didn't clean any of that up, I assume?
  2. If you look here, the connection pool is keyed off the host. This makes sense, since the connection is immediately usable (returning the conn for different hosts would enable us to avoid object creation, but would still require changing fields, and reconnecting), but the consequence is no pooling between hosts. So I'm not surprised to see usage going up. OTOH, in your example, it looks like 10 hosts generated 1.3GB, which seems high. Hmmm.
  3. The code you linked to in aleph.http/request will 99% of the time release, not dispose. Most normally-ended HTTP/1.1 connections are "Connection: keep-alive", so they won't meet the criteria for disposal.
  4. The check for getUtilization(1) is checking the 100th percentile of the utilization stats. This means that all utilizations in the stats time window must be 0. The reservoir size is 4096 items, and if the sample period is the default 10ms, then we should expect it to take ~41s of no usage for the reservoir to become all zeros. If looking at the memory usage too early is the problem, then lowering the sample period may help. It might be worth checking for bugs here, though, I know there have been some utilization calc issues in the past. (clj-commons/dirigiste#23)
KingMob commented 2 years ago

OK, I misread a couple things.

Your code is recurring with another 10000 domains. In my tests, this works out to ~770k per domain. Still highish, but there's 5 major maps of domain->reservoir, and each reservoir has an array of 4096 longs/double, so we have 10010 x 5 x 4096 x 8 bytes = ~1.6 GB. 770k per domain is still high, though; I know Dirigiste is tunable, we could probably make it consume less.

I also took a closer look at the utilizations, which is where it gets interesting. With one conn per domain, it's continually computing the utilization level as 2 (if I'm computing Double/longBitsToDouble correctly). In the code, it seems that queueLength is always 1, available is always 0, and objects is always 0. (Not sure what objects represents... I think it's the number of free objects.) Anyway, that means double utilization = 1.0 - ((available - queueLength) / Math.max(1.0, objects)); is always 2.

It feels like there's a bug here. Regardless of whether the conn is released or disposed, you'd think an unused connection shouldn't count towards utilization. I'm not sure if it's an edge artifact of having exactly 1 item per domain, or if this has always been a problem, and that's just where we see it first, when testing zillions of domains.

KingMob commented 2 years ago

Just forced it to make 3 conns per domain, and all I got was endless utilization values of 4. I need to set this aside for now.

Even if it's a leak, it mostly affects people talking to many different domains, so I'm not sure how high priority it is.

arnaudgeiser commented 2 years ago

Do not get me wrong, everything is fine with my test case. I'm only in trouble because I'm overwhelming the pool, but eventually, everything will be garbage collected if I stop doing requests. I just wanted to demonstrate a situation where a lot of memory could be allocated on the heap.

This pattern, which is not something we should expect from a production system is a situation where we could use a bunch of memory.

But again, without a heap dump here, we are stuck doing hypothesis which might be wrong.

arnaudgeiser commented 2 years ago

Just forced it to make 3 conns per domain, and all I got was endless utilization values of 4. I need to set this aside for now.

Ah yes, interesting lead here.

arnaudgeiser commented 2 years ago

I will spend some time to write unit tests for Dirigiste beause some parts of the code seems to be off, really.

A few examples:

public void sample(double n) {
    int cnt = _count.incrementAndGet();
    if (cnt <= RESERVOIR_SIZE) {
        _values.set(cnt-1, Double.doubleToLongBits(n));
    } else {
        int idx = ThreadLocalRandom.current().nextInt(cnt);
        if (idx < RESERVOIR_SIZE) {
            _values.set(idx, Double.doubleToLongBits(n));
        }
    }
}
  1. We can overflow here

    (def atomic-integer (java.util.concurrent.atomic.AtomicInteger. Integer/MAX_VALUE))
    (.incrementAndGet atomic-integer) ;; -2147483648
    (.incrementAndGet atomic-integer) ;; -2147483647
  2. Then, it will just blow the Reservoir

    (def atomic-long-array (java.util.concurrent.atomic.AtomicIntegerArray. 4096))
    (.set atomic-long-array -2147483648 0.5) ==> ArrayIndexOutOfBoundsException
  3. Once the reservoir is filled, you have no guarantee the last idx (position 4095) will be set to the current utilization as we randomly get a slot over the 4096 position.

https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Stats.java#L61-L71

The last idx (position 4095) is important here because we rely only on this one to determine if the queue is no longer in use.

if (t == 1.0) {
  return (double) vals[cnt-1];
}

https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Stats.java#L166-L168

KingMob commented 2 years ago

Yeah, there's some weird stuff going on, with few tests, and a lot of that is why I'm not a fan of Dirigiste.

Based on a sampling period of 10ms, we should expect overflow and crash after ~8 months left untouched. However, the stats are meant to be replaced periodically. For both Executor and Pool, in startControlLoop, the stats reservoirs are replaced once each control loop by updateStats(). For Pool.updateStats, the .toMap() calls are actually destructive(!). It remove()s the keys as it builds the map. So theoretically, overflow isn't an issue, but I find Dirigiste's logic indirect enough that I wouldn't bet on that.

Once the reservoir is filled, you have no guarantee the last idx (position 4095) will be set to the current utilization as we randomly get a slot over the 4096 position.

Actually, it's weirder than that. It picks a random number from 0 to cnt-1, and only if that's < 4096 does it store the sample at all. This has the effect of updating the reservoir less and less frequently as time goes on, until it's reset. The longer the control period, the less and less likely that later samples will have an effect. I'm not sure what the rationale is over using a ring buffer, and the commit messages are no help.

That being said, using the value in the last position when looking at the percentiles is fine. Inside the reservoir's toArray() fns, you'll see it actually returns them sorted (!), so looking at the last element will always be the maximal utilization. Dirigiste isn't consdiering the most recent utilitzation value, it's considering the highest utilization in the control period.

The problem is that the utilization reservoir contains the same value everywhere, so it doesn't matter what percentile we look at. (The behavior above where later samples in the control period are less likely to be recorded is also a problem, but if the stats are properly reset, it should still work once we have a fallow period.)

So either a) the utilization is never being calculated correctly in our scenario (all scenarios?), or b) it's calculated correctly, but the stats aren't being reset, and thus by the time utilization drops to zero, the odds of it being recorded are low, and the odds of all non-zero samples being displaced is vanishingly low. I think it's more likely to be a; if b were true, I should be able to find other values somewhere, but I can't find any.

It's really annoying that .toArray and .toMap aren't simple transformation fns. Sorry you lost time on that.

arnaudgeiser commented 2 years ago

Thanks for the explanations, I missed so many things here.

For both Executor and Pool, in startControlLoop, the stats reservoirs are replaced once each control loop by updateStats().

Woow... at least that explains the "high" rate allocation I was seeing.

For Pool.updateStats, the .toMap() calls are actually destructive(!). It remove()s the keys as it builds the map.

Right

Dirigiste isn't consdiering the most recent utilitzation value, it's considering the highest utilization in the control period.

Understood, I missed the fact that toArray sorts the values. :face_with_spiral_eyes:

The problem is that the utilization reservoir contains the same value everywhere, so it doesn't matter what percentile we look at. (The behavior above where later samples in the control period are less likely to be recorded is also a problem, but if the stats are properly reset, it should still work once we have a fallow period.)

~This is currently bugging me. There is a correlation with Pool.updateStats but according to what you explain, it's like most of the time the 4096 positions are totally useless. I need to clean my head here.~

Nevermind, got it.

So either a) the utilization is never being calculated correctly in our scenario (all scenarios?), or b) it's calculated correctly, but the stats aren't being reset, and thus by the time utilization drops to zero, the odds of it being recorded are low, and the odds of all non-zero samples being displaced is vanishingly low. I think it's more likely to be a; if b were true, I should be able to find other values somewhere, but I can't find any.

Let's figure this out then!

KingMob commented 2 years ago

Keep in mind, this is still low priority to my way of thinking. The Dirigiste pooling problem only applies to outgoing client connections, where talking to zillions of different servers is unlikely. The server uses Netty to pool incoming connections/channels, so it should be unaffected by Dirigiste's pool stats issue. The most likely scenario is a few outside servers are talked to, and a few MBs are wasted, but that's it.

If you're burning to fix it, I won't stop you tho ๐Ÿ˜„

arnaudgeiser commented 2 years ago

Yes, I agree. But some people are currently having issues with the io.aleph.dirigiste.Pool while it's unlikely they are currently talking to a lot of different servers.

By adding tests, I expect to see patterns where it could go wrong without having to communicate to thousands of different domains.

Going back to this PR, I'm still not sold to it but I can see reasons why people don't want to use Dirigiste on the current situation. According to the discussions we just had, it's quite an heavy machine under the hood. Let's provide an escape hatch then!

I know you want to sunset Dirigiste, but I would prefer stabilizing it. The code base is not that huge and the rationale behind it is still good IMO.

Whether we have no news from @Yneth on the following days, I will take over that PR.

KingMob commented 2 years ago

I agree that other memory leaks of Dirigiste should be fixed. I'm just afraid that the pool leak is a red herring for their other issues.

And while I'd like to sunset it, that's more of a "next version" goal. Replacing it in current Aleph makes me think of these lines from the movie State and Main:

... you come by next Sunday, we'll give you a second chance.

Sister, the only second chance I know is the chance to make the same mistake twice.

What kind of escape hatch are you thinking of, if you don't want to use this PR?

arnaudgeiser commented 2 years ago

I agree that other memory leaks of Dirigiste should be fixed. I'm just afraid that the pool leak is a red herring for their other issues.

I'm also afraid of this!

What kind of escape hatch are you thinking of, if you don't want to use this PR?

We'll use this PR. But I am resigned, not convinced!

KingMob commented 2 years ago

@Yneth Are you still interested in this?

Yneth commented 2 years ago

gentlemen excuse for the late response.

Regarding PR, I do not like it. Almost everything I've been thinking of is not very user friendly. Regarding dirigiste, I personally think it would be nice to fix it in a first place.

My use case involves querying up to 1000 hosts that change on daily basis. Think of a service like https://check-host.net/ that tracks statuses of the services every X minutes.

I do not promise, but I will try to provide heap dump later today

Yneth commented 2 years ago

factory method of pool is dependant on local variables of the API method, as well as some private functions. Of course there should be an opportunity to override, but in a user friendly way

my temporary solution was to copy paste some parts and roll out my own method creating my own dirigiste pool without stats and it works fine and the PR itself is not high priority for me

KingMob commented 2 years ago

@Yneth Don't restrict yourself on where or how to fix this. We're open to the right solution, and if you've followed the conversation so far, the most likely correct place is where Dirigiste computes pool utilization and/or cleans up old pool objects.

Also, based on your initial screenshot, which shows the leak is from the Stats reservoirs, I'm pretty sure Arnaud and I are looking at the same leak, so I don't think we need a dump anymore. (Of course, if that's not your leak, please let us know.)

arnaudgeiser commented 2 years ago

Also, based on your initial screenshot, which shows the leak is from the Stats reservoirs, I'm pretty sure Arnaud and I are looking at the same leak, so I don't think we need a dump anymore. (Of course, if that's not your leak, please let us know.)

In any case, the heap dump can help us diagnose whether we are making the correct assumptions. We'll be able to track the state of those Stats.

arnaudgeiser commented 2 years ago

@Yneth : Are you hitting the same ports on those hosts?

https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L97

arnaudgeiser commented 2 years ago

I spent hours adding (JUnit) tests to Dirigiste and I start to have a better understanding of the situation and the meaning of those different values. (Consider what's below as a partial understanding of the situation, but I need to share some thoughts, some still need to be validated).

So..

Utilization

Theorically, this calculation is supposed to be at least partially correct:

double utilization = 1.0 - ((available - queueLength) / Math.max(1.0, objects));

Considering no object are present on the pool at the beginning:

[1] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L269

PR : https://github.com/clj-commons/dirigiste/pull/32

Reservoir size

I reached the conclusion that using a fixed value of 4096 is a waste.

static final int RESERVOIR_SIZE = 4096;

We should use controlPeriod/samplePeriod instead.

By default

PR : https://github.com/arnaudgeiser/dirigiste/pull/1/files

Use a Ring Buffer instead

While it can be seen as an optimization, replacing the stats every second is not ideal. A Ring Buffer might be more suitable for this use case.

PR : https://github.com/arnaudgeiser/dirigiste/pull/2

Cautious

I will invalidate the above information as I progress. I missed a lot of tests on the Pool but I wanted to write my hypothesis first.

arnaudgeiser commented 2 years ago

So either a) the utilization is never being calculated correctly in our scenario (all scenarios?), or b) it's calculated correctly, but the stats aren't being reset, and thus by the time utilization drops to zero, the odds of it being recorded are low, and the odds of all non-zero samples being displaced is vanishingly low. I think it's more likely to be a; if b were true, I should be able to find other values somewhere, but I can't find any.

Long story short:

It's definitely a. When using an utilization controller, the objects will be removed as soon as they are not in use which means we'll always end up on the concerning scenario (0 objects, 0 available) => utilization of 1.

The fix might be as simple as this:

double getUtilization(int available, int queueLength, int objects) {
    if(objects==0 && queueLength==0) {
        return 0;
    }
    return 1.0 - ((available - queueLength) / Math.max(1.0, objects));
}

However, we are now subject to race condition (youpi...). Our Queue, that we get on .acquire [1] might be destroyed at any moment by the startControlLoop [2], and we might end up on IllegalStateException... already shutdown. [3]

So close and so far at the same time.

For the good news, I've been able to cover around 90% of the Pool code base.

[1] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L409 [2] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L333 [3] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L163

arnaudgeiser commented 2 years ago

I finally reached the bottom of this.

Here is the related PR on Dirigiste : https://github.com/clj-commons/dirigiste/pull/32

arnaudgeiser commented 2 years ago

The PR is ready for review. I applied the @KingMob comments.

Even if we fix the dirigiste bugs, it would make sense to be able to implement your own Pool and Controller.

Sold!

KingMob commented 2 years ago

Merged.