ctdk / goiardi

A Chef server written in Go, able to run entirely in memory, with optional persistence with saving the in-memory data to disk or using MySQL or Postgres as the data storage backend. Docs: http://goiardi.readthedocs.io/en/latest/index.html
http://goiardi.gl
Apache License 2.0
280 stars 39 forks source link

Excessive memory pressure during cookbook upload #77

Open alekc opened 4 years ago

alekc commented 4 years ago

Describe the bug While doing a migration from Chef to Goiardi, we encountered a significant memory issue during knife upload cookbooks/ (We are running multiple pod with s3&postgresql)

image

image

heap.out.zip

By the look of it there are at least several points which should be looked into

@ctdk your thoughts about it? Do you know if there are any kind of rfc for request/response objects of chef?

p.s. the dropdown you see in graph is because container is oom killed and restarted. Also, in our case, rss usage potentially might be caused by https://golang.org/doc/go1.12#runtime

ctdk commented 4 years ago

Interesting indeed. These are all very good points. One question: what kind of search do you have configured? The custom in-mem search can use a lot of memory when there are a lot of cookbooks; if you're not using the postgresql search it's something to consider. (Remember to VACUUM periodically, and I am very open to suggestions there as well. I suspect there are some indexes that aren't adding much to the search.)

In regards to Decoder vs. Unmarshal, and unmarshalling in general, those changes are likely doable; I haven't revisited that code much in a while. The 1.0.0 dev branches might have changes to address this, too, but I don't remember off the top of my head. Someday, I'll get that finally done. There are a couple of potential sticking points with handling mangled cookbooks and how the format of the cookbooks themselves is a bit of a moving target. I'll push this on the stack of investigations.

alekc commented 4 years ago

Nope, its not in memory issue. We are running with

GOIARDI_PG_SEARCH: true
GOIARDI_USE_POSTGRESQL: true

also, cluster was not in production so it was not running any other operation besides cookbook upload. I believe another thing which needs to be investigated is this

func (c *Cookbook) deleteHashes(fhashes []string) {
    /* And remove the unused hashes. Currently, sigh, this involves checking
     * every cookbook. Probably will be easier with an actual database, I
     * imagine. */

even though we are using database (and such information "probably" might me found in a single query), atm we are still fetching all cookbooks and (if I am not wrong) decoding blobs from them.

Hopefully we will get to tackle this in the course of the current or next sprint, I will post updates if I have any.

p.s. what about chef json schemas, do you know if they exist by any chance?

ctdk commented 4 years ago

nod That information will help a lot. Re: deleteHashes, that should only be happening when it's running in-mem. I'll take a look there too to confirm, though. Sometimes I don't always update comments like I should.

In re: Chef JSON schemas, I'm not certain. As I recall I worked off of the Chef Server API documentation of the time, along with looking at chef-zero and the chef-pedant tests. Sometimes, the documentation and the actual chef tools (not just pedant, but things like knife) disagreed. The current API docs would be the first place I'd look, with a caveat that there are some missing endpoints and some endpoints which are listed don't include all potential methods.

alekc commented 4 years ago

Turns out its worse than I originally thought.

Originally I assumed that the issue is caused by goiardi not releasing heap objects fast enough (I was hammering it after all), so I did the knife upload cookbook --concurrency 1, uploaded couple of cookbooks and stopped process.

After approx 10 minutes I was still seeing this picture: image

also, heap was pretty full with old objects which were not released, so the issue is that (at least when using db), there is either a memory leak somewhere, or the gc is not running properly.

Given the severity (at least for our scenario) I am going to actively work on this in the course of these days, will update once I have anything.

ctdk commented 4 years ago

I'm on a little vacation with my family, and have less internet access than I expected, but I did sit down and take a look at the cookbook memory issue you reported and I believe I've found the problem.

The Decode vs. Unmarshal issue isn't solved yet; I'm unsure if I used Decode there because I wrote that early in my learning how go works, or because cookbooks can be very poorly formatted but still need to be accepted.

However, the cookbook objects also keep a cached copy of the cookbook version inside themselves. This grew out of the in-memory implementation, of course, but honestly isn't even ideal there. Uploading a cookbook keeps it hanging around in there, and if it's not present it will grab it from the DB and stick it in the cache. The memory overhead and risk of inconsistent results is far too high to keep this around, I believe, so I'm going to clip out this bit of grossly premature optimization.

It does explain some trouble people have had in the past at least. Thank you again for bringing it up, good sir.

On Fri, Jul 31, 2020, 3:54 AM Alexander Chernov notifications@github.com wrote:

Turns out its worse than I originally thought.

Originally I assumed that the issue is caused by goiardi not releasing heap objects fast enough (I was hammering it after all), so I did the knife upload cookbook --concurrency 1, uploaded couple of cookbooks and stopped process.

After approx 10 minutes I was still seeing this picture: [image: image] https://user-images.githubusercontent.com/302807/89028420-359d3500-d324-11ea-838a-1c637155af51.png

also, heap was pretty full with old objects which were not released, so the issue is that (at least when using db), there is either a memory leak somewhere, or the gc is not running properly.

Given the severity (at least for our scenario) I am going to actively work on this in the course of these days, will update once I have anything.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ctdk/goiardi/issues/77#issuecomment-667062104, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUL24L4MMACVGAK7LIZKDR6KPHJANCNFSM4PLVSYOQ .

alekc commented 4 years ago

Sadly its not only that. If I am understanding your correctly, you are talking mostly about the association cookbook->cookbook version. In theory, (since I am working completely on s3/postgresql) such objects would be (eventually) collected by cg since they are not referenced anywhere (parent cookbooks are not being kept in cache as far as I see).

I believe most of the issues are coming from a combination of anonymouse json and its serialization/deserialization/casting). I ended up rewriting the whole section related to a cookbook handling (its still work in progress) and so far I found at the very least following issues which are affecting performance:

1) When we try to check if we need to clear up all hashes, we load up all cookbooks, and then, for every cookbook we launch a query fetching all cookbook versions. In our case, its 715 cookbooks, so 715 queries against db. Replaced that bit with a query which fetches all cookbook versions and associates them to a proper cookbooks (can be improved further if IN (?) clause is added but sadly generic sql driver doesn't support it yet for some reason, and for now I do not want to import sqlx just yet.

2) Later on, we repeat the fetching of all cookbook versions even though we already have them. So, another 715 queries.

3) At some point we check if all files in the cookbook has been uploaded. such check is not an issue normally, but in case of s3, we end up checking lets say 50 files in one single go routine which is an issue from performance point of view. I believe this is an overkill since normally, knife checks through sandbox if everything has been uploaded, and I cannot imagine a situation (well, besides race condition which cannot be fixed with just rechecking) where this can break. Commented out check block for now.

4) some weird comparison in check for hashes (if they have to be deleted or not). i.e. https://github.com/ctdk/goiardi/blob/3328c0eef8b27de604a22c2b602fbbc91df3ea0e/cookbook/cookbook.go#L832

5) uhm

// DeleteVersion deletes a particular version of a cookbook.
func (c *Cookbook) DeleteVersion(cbVersion string) util.Gerror {
if config.UsingDB() {
        err := cbv.deleteCookbookVersionSQL()
        if err != nil {
            return nil
        }
    }

6) For now I am trying to replace generic json with a proper structures and use json.unmarshal, will see where it gets me.

ctdk commented 4 years ago

I spent some time researching the Decode vs. Unmarshal thing too, and that might not be as straightforward as it seems. From my research, it looks like Decode's better to use when decoding a stream of data like reading from r.Body, because Unmarshal requires loading the entire byte slice into memory. I was also looking around for anything on JSON decoding related memory leaks, and it appears that the pprof info can sometimes be deceptive; the memory is allocated in those methods, but they're being held onto elsewhere. That's why I was looking at the cookbook -> cookbook versioning map.

Also, while I was looking over the cookbook version creation I realized a big part of why I had done it that way - there's a lot of weird validations to do on the uploaded cookbooks. Hopefully it can be worked around easily enough, but it's a concern.

The file hashes deletion bit is not in any way my finest work, I'm afraid. That's a very uninspired and barely changed way from what I came up with when it was all in-memory that I've never gotten around to fixing. Unfortunately, as you've noticed, the brute force approach I had to resort to initially stuck around and left all kinds of horrible bits. Good catch on the s3 deletion.

I realized something just now while I was typing this, though: For postgres, at least, we don't even have to load every single cookbook - the file metadata in the cookbooks is in JSON, and we can use the postgres json operators to select just the filehashes. Take a look in the search code where it does partial searches, and you'll find an example. FWIW, the postgres partial search code is significantly faster than the in-mem or MySQL code.

Anyway, that sounds good. I'll look at the hash deletion first, I think, but also have more things with decoding or unmarshalling and the cookbook objects I want to look at.

alekc commented 4 years ago

I've been able to fix the memory leak (sadly there was a lot of refactoring and the change cannot be merged in to the master branch due to conflicts related to non merged changes (especially s3).

I will try to extrapolate them for a separate commit which can be brought into the master.

ctdk commented 4 years ago

What was the fix, and can I see the branch with the changes?

alekc commented 4 years ago

@ctdk I've created a pull request. NOTE that pr should NOT be merged because atm we are

In order to send off that branch I had to include everything which have not been merged so far. For the scope of current change, we are interested only in https://github.com/ctdk/goiardi/pull/81/commits/f1600d29cc4969e6c3389e0d04acdbcb1b6c7f1a

As for results of the change, this should speak for itself: image

Blue and yellow = non patched goiardi running against knife upload cookbooks --concurrency 10, pods constantly killed by OOM. Greenish is a patched pod running knife upload cookbooks --concurrency 30: no oom and memory actually goes down sometimes

Notable Changes: