APCng storage - removed APCUIterator usage while collecting metrics, removed metrics collecting lag

Rastusik commented 1 year ago

The algorithm is simple - it uses an atomic counter to count all stored metrics metadata keys, and stores the keys into a counted list (multiple data keys with a counter pattern)

LKaemmerling commented 1 year ago

Hey @Rastusik,

first of all, thank you, but what do you want to achieve with this change? So basically what is the intention behind it?

Rastusik commented 1 year ago

hi @LKaemmerling , I just wanted to get rid of the last apcuiterator occurence in the metrics collecting routine. It doesn't have to run ever, and the price for it is small. And as a bonus, there is no more need for the caching TTL for the meta data, because the meta data cache will be repopulated effectively after addition of new metric (during the next collecting routine run)

Rastusik commented 1 year ago

@LKaemmerling what do you think? Is this contribution acceptable? Shall I fix the tests? Although they are not in any way connected to the changes

LKaemmerling commented 1 year ago

@Rastusik of course every contribution is welcome, but I still don’t understand what is the exact benefit from this change? Is it Performances reason? If so please provide benchmarks

Rastusik commented 1 year ago

@LKaemmerling the primary reason is that I got rid of this code:

private function scanAndBuildMetainfoCache(): array
    {
        $arr = [];
        $matchAllMeta = sprintf('/^%s:.*:meta/', $this->prometheusPrefix);
        foreach (new APCUIterator($matchAllMeta) as $apc_record) {
            $arr[] = $apc_record['key'];
        }
        if ($apc_ttl >= 1) {
            apcu_store($this->metainfoCacheKey, $arr, $apc_ttl);
        }

The new algorithm makes it possible to build metainfo cache without scanning the whole apcu content. Other improvement is that there is no reason any more to periodically regenerate metainfo cache after some TTL period, the metrics are all regenerated during the first metrics collection run after each metric addition. It is possible because the new metainfo cache generation algorithm has complexity O(N), where N is the metrics count. The old algorithm has also O(N) complexity, but N is the count of apcu entries. This is a problem for one of our projects which stores 1GB of entries in APCU. AcpuIterator creates unnecessary locks for our use case, which means that several web requests take 5 seconds during first cache generation run. I know that the general case has already been solved in the APCng adapter, this is just a small improvement to get rid of all the apcu scanning.

I don't have any benchmarks, because this code just handles the special case of the first generation of metainfo cache. Basically the benchmarks would be the same as before with APCng, this is just an improvement of the first metrics collection run.

Feel free to ask any question, I'm not sure if I was able to explain the improvement good enough.

Rastusik commented 1 year ago

@LKaemmerling I added some other minor improvements and fixed the tests

regarding the intent behind all these changes: I'm working on projects which are built on Swoole. the apps collect metrics per app instance in APCu. These apps are relatively highly performant (10 swoole worker processes, 20 reqs/s), and they store around 150 metrics in each request, and there is potential for more metrics and more concurrent processes. that is why I needed to get rid of as much blocking as possible. this code already is deployed in production and works flawlessly.

what do you think about this?

LKaemmerling commented 1 year ago

@Rastusik generally said I like the idea! I'm currently quite sick, so I will look into it when I'm fully recovered hopefully at the end of next week/the week after.

Rastusik commented 1 year ago

@LKaemmerling sorry for disturbing, I was just expecting some/any feedback, thank you for clarification. Get well soon ;)

LKaemmerling commented 1 year ago

@Rastusik thank you! This looks fine from my side even if I'm not experienced with APC(u), also the tests pass (with changes that are understandable).

ashandi commented 4 months ago

Hello guys!

Seems like these changes caused the following issues : https://github.com/PromPHP/prometheus_client_php/issues/124 https://github.com/PromPHP/prometheus_client_php/issues/126

@Rastusik, do you have any idea why some of the keys can not be found in the cache during work o the scanAndBuildMetainfoCache function ?

Rastusik commented 4 months ago

@ashandi I can't remember the details, because this work was done long ago, but I believe that the problem is with your APCu php settings. ttl=0 doesn't really mean anything, when the APCu size hits the configured limit, all the data is expunged (I believe this is the term used in apcu source code). This code expects that data won't be expunged from the cache and the expectation is to configure enough RAM for APCu if ttl=0 is used, or use shorter a defined ttl so the cache size will never be greater than configured limit. The old algorithm worked well, but if the cache needed to be regenerated, we expected 1+ second delays in PHP processing, that's why this change was introduced. I can't really remember the proper apcu configuration because I no longer work for the same company, but it should be straightforward to check the apcu source code and get the idea.

ashandi commented 4 months ago

Hey @Rastusik, thank you for the response!

You are right, ttl=0 means that the value will persist until it is removed from the cache by the system (restart or memory limit exceeded). So we caught the exception because we probably have too small an APCu memory size. My main point here is the following: this library shouldn't raise an exception if one specific metric is missed. As I can see, all the other adapters like Redis just ignore the situation when a metaKey is missing and continue to collect the other metrics. It would be very kind of you to take a look at my pull request and give it a thumbs-up, or explain why it might not be the best solution. Thank you.

Rastusik commented 4 months ago

hey @ashandi , I don't really remember what our problem was in details, I just know that there were some apcu deadlocks when we were losing data from apcu (when they were expunged). The point was that when one php process found out that some key is missing in the cache, because it was expunged, it wanted to add it... but if the php process hit the point in time when a different php process was running the expunge algorithm in apcu, there were deadlocks. We had a performance intensive application with cca 80 requests/s with 10 php processes on swoole and the deadlocks brought the application pod down, so we needed to make sure that the expunge algorithm would never be triggered. that's why we were setting the ttl (and maybe there were 2 different ttls) to a very high non-zero value, probably some months into to future. The root cause of the problem seems to be the apcu expunge algorithm and this was our workaround.

I'm not really able to tell you if the changes from your pull request will break something or not. According to what I just wrote it seems like with proper apcu settings with no expunges triggered, the condition that you are describing shouldn't occur. If you want to have some custom apcu settings and it's improbable that you will hit the deadlocks, maybe your changes will make sense - it looks like the data counters will just be reset. maybe prometheus knows how to handle those resets already.

From what I remember, I wanted to avoid the expunge of the metrics data from apcu, but I can't really remember why it was the case and have no time to work it out.

I would recommend you to try to use your implementation in production and see if it works, because this fix can be problematic and other fixes will need to follow it, so it would be better to just test and improve the whole batch of fixes before releasing them. And I also recommend you to check the apcu extension C code, it's relatively simple and you will get the idea what's happening underneath the surface.

ashandi commented 4 months ago

@Rastusik, I don't think that I can manage ttl settings at all because all the apcu_add, apcu_store functions are used with ttl=0 that overwrites the setting from php.ini. However, it's true that I shouldn't catch this exception if I set a necessary size of memory for APCu because the number of metrics is limited and shouldn't grow with time. Anyway, thank you for your explanation, I'll try to increase my apc.shm_size and see what will happen.

PromPHP / prometheus_client_php

APCng storage - removed APCUIterator usage while collecting metrics, removed metrics collecting lag #96