ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 582 forks source link

ocf:heartbeat:nginx cannot restart after SIGKILL sent to nginx master #1553

Open SpitchAG opened 4 years ago

SpitchAG commented 4 years ago

I think something is not right in this agent, because if you send a SIGKILL to master nginx, worker threads stay around and one of them starts listening to configured listen port, preventing nginx new master to start (bind error, address already used).

using the reuseport directive allows new master to start but then there is a leak of workers,

a quick workaround would be to fence the host on nginx start failure but hey if this can be avoided ...

SpitchAG commented 4 years ago

in the stop_nginx there is some code to try to kill remaining process, but the pgrep -f is a bit awkward, doesnt seem to grep anything as workers are not started with full cli args.

i did a quick workaround (in stop_nginx) by trying to find any nginx process listening on PORT (if provided in the crm resource config). If such a process exists i kill it, and lookup again until no workers listen to the port: (if the loop cannot be exited, stop will timeout and node will be fenced, eventually, maybe thats acceptable) this also assumes netstat is installed,

if [ -n "$PORT" ]; then while true; do pid=$(netstat -pnlt | grep ':$PORT' | grep nginx | awk '{ print $7 }' | awk -F/ '{ print $1 }') if [ -n "$pid" ]; then ocf_log warn "killing WORKER PID $pid" kill $pid sleep 1 else break fi done fi

seems to be fine in my setting dunno if there are use cases where this wont work,