banzaicloud / cloudinfo

Cloud instance type and price information as a service
https://banzaicloud.com/cloudinfo
Apache License 2.0
165 stars 36 forks source link

Helm chart with redis fails due to 'failed to set key to value' #336

Closed mrsheepuk closed 4 years ago

mrsheepuk commented 4 years ago

Describe the bug Installing cloudinfo using the helm chart at 0.7.1 and the default version of the cloudinfo image (0.9.15) fails to create a working environment when enabling redis in the chart, as the scraper cannot write to it, logging errors like the below for every value it tries to write:

{"application":"cloudinfo","cistore":"redis","environment":"production","key":"/banzaicloud.com/cloudinfo/providers/google/regions/asia-northeast3/prices/n1-highcpu-2","level":"error","msg":"failed to set key to value","time":"2020-06-22T09:21:22Z","value":{"onDemandPrice":0.0909891416015625,"spotPrice":{}}}

Steps to reproduce the issue: helm install cloudinftest banzaicloud-stable/cloudinfo -f ./cloudinfo-deploy.yml

... with the following values in cloudinfo-deploy.yml (including correct and valid google credentials):

providers:
  google:
    enabled: true
    credentials: **redacted**
store:
  redis:
    enabled: true

redis:
  enabled: true

  # TODO: support redis password
  usePassword: false
  cluster:
    enabled: false

Expected behavior Scraper works as expected and posts values into redis for the application.

Screenshots n/a

Additional context Tested only with Google Cloud enabled at this time.

mrsheepuk commented 4 years ago

I've tested this by additionally specifying the image tag 0.12.1:

image:
  tag: 0.12.1

... and it appears to work as expected, so it looks like this is specific to the default version in the current helm chart.

mrsheepuk commented 4 years ago

Sorry, the above comment isn't correct - I believed it was working with 0.12.1 but now I'm seeing the same behaviour, so it seems to be transient / timing related on start up.

Scaling the scraper deployment down to 0 and back to 1 after deploying seems to fix it consistently, so it might be related to #184

sagikazarmark commented 4 years ago

@mrsheepuk thanks for reporting.

It indeed looks like a timing issue, but I would assume that the scraper component recovers once it's able to connect to redis. Isn't it the case? Or does the helm install process fail entirely?

mrsheepuk commented 4 years ago

Thanks for responding @sagikazarmark - the helm install completes successfully, and the pods are all up and running as expected, it's just the errors that are logged which make it look like it's not working.

I'll double check whether it effectively recovers or not - I was basing this on what was logged, but it might be that it's not logging when it works so I'd assumed it wasn't working when in fact it's recovered correctly. I'll report back shortly after confirming this.

sagikazarmark commented 4 years ago

Well, I'm not completely sure TBH. The scraper will scrape the providers at a predefined interval (by default 24 hours), so chances are if the first scrape fails, but the pod doesn't exit, then it probably waits for the next cycle, which is not ideal.

Let me know if it recovers.

mrsheepuk commented 4 years ago

The scraper will scrape the providers at a predefined interval (by default 24 hours), so chances are if the first scrape fails, but the pod doesn't exit, then it probably waits for the next cycle, which is not ideal.

That's what I assumed was happening, however now I want to reproduce the problem, of course, it is working perfectly every time with no errors logged on 0.12.1.

Testing again with the default version, however, I can see that on 0.9.15, it doesn't recover cleanly - everything is 'up and running' but there's no data persisted and available through the front end, and the log indicates it's not trying to re-scrape or re-persist anything. I imagine it would recover after 24 hours, as you suggest.

I'll keep trying to reproduce this on 0.12.1 - it certainly seems to happen much less often on that version.

sagikazarmark commented 4 years ago

There are no changes between 0.9.x and 0.12.x that would "fix" the recovery, so if it works it's probably because of timing (ie. it takes time to download the new image).

I think the scraper should be resilient enough to retry scraping if the result cannot be saved, or hold it in memory until it can.

mrsheepuk commented 4 years ago

First time I've looked at the cloudinfo code, so I might have something wrong, but it looks to me like if there are any failures writing into redis, they will be logged but essentially ignored, for example (added a few comments inline):

func (rps *redisProductStore) set(key string, value interface{}) (interface{}, bool) {
    // snipped code 
    if _, err = conn.Do("SET", key, mJson); err != nil {
        rps.log.Error("failed to set key to value", map[string]interface{}{"key": key, "value": value})
        // this return is not checked in (e.g.) StorePrice, so nothing 
        // other than this log message will know this has failed
        return nil, false
    }

    return mJson, true
}

And the way this is being called:

func (rps *redisProductStore) StorePrice(provider, region, instanceType string, val types.Price) {
    // not checking return value here, so callers of StorePrice won't know this
    // has failed:
    rps.set(rps.getKey(cloudinfo.PriceKeyTemplate, provider, region, instanceType), val)
}

The 'sledgehammer' approach to this specific problem could be something that pings a test value into redis repeatedly on startup waiting for it to be available/healthy before completing startup - but that wouldn't help if there was a transient problem with the redis instance at some later point in time. A more nuanced approach would require changing all of these StoreXxx interfaces to return the success/failure so wherever they are called could retry as needed.

If I get time later, I'll see if I can work out a PR to attempt to address one or other of these options.

mrsheepuk commented 4 years ago

@sagikazarmark I've raised a PR to simply fail on startup if the product store is not available, let me know what you think.