Intermittent issue regarding to Chef server 12 knife upload cookbook

yinym commented 9 years ago

Version

chef-server-core-12.0.5-1.el6.x86_64 chef-12.0.3-1.el6.x86_64

Environment

RHEL 6.6 x86_64

Summary

After I successfully installed and configured the new Chef server 12, I tried to upload cookbooks in batch to this Chef server, but got intermittent uploading issue.

# /opt/chef/bin/knife cookbook upload apache2 apt aws build-essential ceph chef_handler chef-sugar cpu database dmg erlang git haproxy homebrew
Uploading apache2      [3.0.0]
Uploading apt          [2.6.1]
Uploading aws          [1.0.0]
Uploading build-essential [1.4.2]
Uploading ceph         [0.8.0]
Uploading chef_handler [1.1.5]
Uploading chef-sugar   [2.5.0]
Uploading cpu          [0.2.0]
Uploading database     [4.0.2]
ERROR: You authenticated successfully to https://binbin-chef.localdomain:443/organizations/ibm as os-software-management but you are not authorized for this action
Response:  missing create permission

This is obviously an issue, because I have successfully uploaded several cookbooks. And also, this appears intermittently. Sometimes, you failed at the third cookbook, and sometimes, you failed at the last one. Retrying to upload several times, all the cookbooks can be uploaded successfully.

Details

I tried to dig into the log files, and noticed following: In /var/log/opscode/nginx/access.log, at the failing time, it reported:

"POST /organizations/ppp/sandboxes HTTP/1.1" 403 "2.877" 39 "-" "Chef Knife/12.0.3 (ruby-2.1.4-p265; ohai-8.0.1; x86_64-linux; +http://opscode.com)" "127.0.0.1:8000" "403" "2.875" "12.0.3" "algorithm=sha1;version=1.0;" "os-software-management" "2015-04-09T03:21:07Z" "Z16jPrg9oK6vgxkDl+IPAREhbdY=" 1693

At the same time, in /var/log/opscode/opscode-erchef/current:

[error] {<<"method=POST; path=/organizations; status=500; ">>,{error,{case_clause,{error,server_error}},[{oc_chef_authz,create_entity_if_authorized,4,[{file,"src/oc_chef_authz.erl"},{line,139}]},{oc_chef_wm_base,do_create_in_container,4,[{file,"src/oc_chef_wm_base.erl"},{line,274}]},{webmachine_resource,resource_call,3,[{file,"src/webmachine_resource.erl"},{line,186}]},{webmachine_resource,do,3,[{file,"src/webmachine_resource.erl"},{line,142}]},{webmachine_decision_core,resource_call,1,[{file,"src/webmachine_decision_core.erl"},{line,48}]},{webmachine_decision_core,decision,1,[{file,"src/webmachine_decision_core.erl"},{line,225}]},{webmachine_decision_core,handle_request,2,[{file,"src/webmachine_decision_core.erl"},{line,33}]},{webmachine_mochiweb,loop,2,[{file,"src/webmachine_mochiweb.erl"},{line,72}]}]}}
[error] Unable to read ACL for newly created sandbox: <<"cccc636437f6b5eee1070628579486ad">>
[error] {<<"method=POST; path=/organizations/ppp/sandboxes; status=403; ">>,"Forbidden"}
[error] Error setting ACE {authz_ace,[<<"302d9d7b8965c2655c81ee7f2fb62599">>,<<"81abaf8f7ae0d80ff013a9c079854048">>],[<<"776b85b0df0c2066d5056df11b8c4d85">>]} for method update on object <<"cccc636437f6b5eee1070628579486ad">> for requestor <<"81abaf8f7ae0d80ff013a9c079854048">>: req_timedout
[error] Unable to read ACL for newly created sandbox: <<"cccc636437f6b5eee1070628579486ad">>

In the Error setting ACE ... log, the request actually timedout: req_timedout

By checking oc_bifrost log /var/log/opscode/oc_bifrost/requests.log.1 for ACL processing, an interesting log showed up:

oc_bifrost@127.0.0.1 method=PUT; path=/objects/cccc636437f6b5eee1070628579486ad/acl/read; status=200; requestor_id=81abaf8f7ae0d80ff013a9c079854048; req_time=9; rdbms.bifrost_db.has_permission_time=2; rdbms.bifrost_db.has_permission_count=1; rdbms.bifrost_db.exists_time=1; rdbms.bifrost_db.exists_count=1; rdbms.bifrost_db.update_acl_time=3; rdbms.bifrost_db.update_acl_count=1; 
oc_bifrost@127.0.0.1 method=PUT; path=/objects/cccc636437f6b5eee1070628579486ad/acl/update; status=200; requestor_id=81abaf8f7ae0d80ff013a9c079854048; req_time=2470; rdbms.bifrost_db.has_permission_time=1; rdbms.bifrost_db.has_permission_count=1; rdbms.bifrost_db.exists_time=0; rdbms.bifrost_db.exists_count=1; rdbms.bifrost_db.update_acl_time=2467; rdbms.bifrost_db.update_acl_count=1; 
oc_bifrost@127.0.0.1 method=GET; path=/containers/b02d710e4ecf1bf869dd573bc70a0e53/acl/read/actors/81abaf8f7ae0d80ff013a9c079854048; status=200; requestor_id=81abaf8f7ae0d80ff013a9c079854048; req_time=5; rdbms.bifrost_db.has_permission_time=2; rdbms.bifrost_db.has_permission_count=1; rdbms.bifrost_db.exists_time=1; rdbms.bifrost_db.exists_count=1;

Please notice the req_time and rdbms.bifrost_db.update_acl_time in the second line of log, they are extremely longer than others (thousands of times). This probably caused this issue. DB operation is too long and the erchef thought it was timedout.

This issue does not happen on all the systems, not sure if it is related to environment configuration.

kramvan1 commented 9 years ago

Just a quick update, we are now on RHEL 7.1 and still seeing this issue.

tomblich-okta commented 8 years ago

Have you had any resolution to this? I am in the process of tracking down similar issues with doing a knife upload to chef. We resolved our issues by moving from standard EBS volumes to gp2 volumes. One thing I did notice, is we aren't getting oc_bifrost requests.log being written to.

PhilGrayson commented 8 years ago

I saw this issue recently. There was a bunch of errors across the chef services that all point to timeouts.

Note, I've removed parts of these errors to help legibility.

# In opscode-erchef/current
[error] Error setting ACE {authz_ace,[...]} for method delete on object ... for requestor ...: req_timedout

# in nginx/error.log
[error] [lua] config.lua:60: connect_redis(): failed to authenticate to redis: timeout, ...

# in oc_bifrost/crash.log
{<<"method=GET; path=...; status=500; ">>,{error,{error,{case_clause,{error,timeout}},[{bifrost_wm_acl_member_resource,to_json ...

It turns our of Chef servers (running in tiered setup) were running on highly contended hypervisors.

The redis-cli tool helped observe this problem.

# /opt/opscode/embedded/bin/redis-cli --intrinsic-latency 200
...
Max latency so far: 2450172 microseconds

2.4 seconds yikes!

Upping various timeout options resolved the problem for us until we could move to a less contended environment.

# /etc/opscode/chef-server.rb
opscode_erchef['authz_timeout'] = 5000
oc_chef_authz['ibrowse_options'] = '[{connect_timeout, 5000}]'
lb['redis_connection_timeout'] = 5000
lb['redis_keepalive_timeout'] = 5000

hfinucane commented 8 years ago

I've been running into relatively frequently, as part of an everything-in-ci project. It seems like the fix should be "internal timeouts should be reported to the user as 500 errors, not 403 errors, which is a confusing lie".

dwradcliffe commented 7 years ago

I know this is old but we're seeing the same thing.

PrajaktaPurohit commented 4 years ago

TODO: Start with looking at the sandbox error handling code. Make sure the expected error response is sent to the user.

chef / chef-server