ansible / workshops

Training Course for Ansible Automation Platform
MIT License
1.76k stars 1.15k forks source link

RHPDS - PROD_ANSIBLE_WORKSHOPS: insufficient resources #1183

Closed rhjcd closed 3 years ago

rhjcd commented 3 years ago

Problem Summary

Per email:

[] we have received the below request from a trainer.

I conducted a workshop today of Ansible for Linux Administration for 25 students. The instance types for the ansible's control node, tower, and vscode (all the same node) experienced oom-killer outages. This was disruptive to the student's experience. Can we get the resources increased for the ansible control node? Currently, it's set at 2 CPUs and 4GB ram.

Issue Type

Bug

Extra vars file

vars

Ansible Playbook Output

https://gist.github.com/rhjcd/9f9e7b27b1fa39304663d1b015207e1e

Ansible Version

$ ansible --version ansible 2.9.21 config file = /home/student7/.ansible.cfg configured module search path = ['/home/student7/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python3.6/site-packages/ansible executable location = /usr/bin/ansible python version = 3.6.8 (default, Mar 18 2021, 08:58:41) [GCC 8.4.1 20200928 (Red Hat 8.4.1-1)]

Ansible Configuration

DEFAULT_HOST_LIST(/home/student7/.ansible.cfg) = ['/home/student7/lab_inventory/hosts'] DEFAULT_STDOUT_CALLBACK(/home/student7/.ansible.cfg) = yaml DEFAULT_TIMEOUT(/home/student7/.ansible.cfg) = 60 DEPRECATION_WARNINGS(/home/student7/.ansible.cfg) = False HOST_KEY_CHECKING(/home/student7/.ansible.cfg) = False PERSISTENT_COMMAND_TIMEOUT(/home/student7/.ansible.cfg) = 200 PERSISTENT_CONNECT_TIMEOUT(/home/student7/.ansible.cfg) = 200 RETRY_FILES_ENABLED(/home/student7/.ansible.cfg) = False

Ansible Execution Node

Ansible Controller (previously known as Ansible Tower)

Operating System

rhpds provision for Ansible Workshop

IPvSean commented 3 years ago

@rhjcd how often is this coming up? is this common or a one-off? Is it coming up for more than one type of workshop?

rhjcd commented 3 years ago

it's been brought to my personal attention 4 times through ServiceNow tickets which are unfortunately now archived.

IPvSean commented 3 years ago

should be solved with https://github.com/ansible/workshops/pull/1193, please re-open if this persists

9strands commented 3 years ago

Hi, @IPvSean - would the following INC I received today from a student qualify for this? Student lab info is for title: PROD_ANSIBLE_DEMOS-555c in the deployer.

Note from Student: The nodes are under-resourced and an OpenSCAP scan cannot be completed without being terminated by process oom-kill from the kernel.

kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-11.scope,task=oscap,pid=21068,uid=0
kernel: Out of memory: Killed process 21068 (oscap) total-vm:1787452kB, anon-rss:372460kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1292kB oom_score_adj:0
kernel: oom_reaper: reaped process 21068 (oscap), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
trevorbryant commented 3 years ago

@IPvSean, I know that this is now closed but I want to provide you with a response to your question. I have attempted to use this demo about four times within ~2 months now unsuccessfully. The scan does initiative with return with Killed, with the above logs in /var/log/messages.

cloin commented 3 years ago

Reopening to have a place to document OOM issues on RHEL workshops. Changes to instance types as requested will affect operating budget. Thanks for the help!

9strands commented 3 years ago

We have another user report that seems to be related to this as well: TASK0987679

During the last Hands-on session (May 2021), some VMs became unresponsive and we were not able to reboot them.

During this event preparation, I also noticed that my VMs hung  time to time.  I found out that the VMs run « wcgrid_opn1_aut » processes uses all the CPU. 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                   
  19855 boinc     39  19  166624  75096   2416 R  99.3   9.4   0:04.55 wcgrid_opn1_aut                                                                                                                                            
  19857 boinc     39  19  165948  74592   2416 R  99.3   9.3   0:03.55 wcgrid_opn1_aut        

It look like these processes are related to BOINC project  https://fr.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing 
9strands commented 3 years ago

@cloin - how are BOINC processes running on the infrastructure? I find that quite unusual, especially as that would cause a likely large increase in CPU, memory, and disk utilization - I have worked at orgs which had this on each person's workstation and I recall the issues that we had whenever there was high load on the desktop.

9strands commented 3 years ago

Another one from the EMEA regions: TASK0987385

I am delivering an ansible workshop for a large customer later today. I have deployed lab with GUID 22ab. Deployment looked successful.

I noticed that studen17-node3 and student45-node3  with respective IPs *.*.*.* and *.*.*.* are not responding to simple ssh connection and I have no way to restart those instances.

Can you please assist and eventually run a sanity check on the whole environment if you have a ready script
heatmiser commented 3 years ago

I saw a similar issue today for the RHEL Automation workshop...4 to 5 students experienced losing connectivity with their instances around exercises 1.4 to 1.5. Not even doing anything real intensive. I logged into a lab env and worked through the lab exercises, I lost connectivity while on exercise 1.5 during the ftp deploy playbook while it was running the fact scans. After that node3 was unreachable and took 40+ minutes until it eventually came back. I did notice that the instances for the RHEL nodes were t3.micro instances, where Amazon does document this type of behavior...

If the system logs don't contain disk full errors, view the CPUUtilization metric for your instance. If the CPUUtilization metric is at or near 100%, the instance might not have enough compute capacity for the kernel to run.

Run a bunch of these burstable instances and chew through CPU credits quickly, this is the type of behavior you're going to see. They aren't meant to be utilized for interactive workloads, where they are all being used at exactly the same time...like during a workshop, for example.

Note that several months back, a few of us did run into similar behavior that has been reported in some other comments here earlier about running OpenSCAP scans..we were originally using t3.micro instances for the RHEL7 and CentOS7 nodes while working on building out the Smart Management workshop content, however, they kept crashing during OpenSCAP scans and Convert2RHEL runs, so we moved to t2.medium instance sizes and after that, no more crashes.

cloin commented 3 years ago

@9strands re: BOINC: this should be disabled when a student begins interacting with the environment. Additionally, there should be a job template that allows for disabling BOINC.

cloin commented 3 years ago

I will make a release later today setting community_grid (BOINC) to false and monitor this issue.

FYI, this is what I'm talking about: https://github.com/ansible/workshops/tree/devel/provisioner#ibm-community-grid

cloin commented 3 years ago

However, community_grid is not on the control node, so not going to affect memory conditions on tower. Any reference to nodes not control nodes should be separate issues. Although, @heatmiser @9strands, my previous comment might resolve the issues you've seen on the inventory nodes.

heatmiser commented 3 years ago

@cloin In the case of the Smart Management workshop, we were already disabling community_grid for our test deployments as we worked through building out the workshop content. Apologies, I now understand that this issue is for the control node...I mistook the conversation about OpenSCAP scans failing to mean that this issue was trying to address all nodes. Do you recommend that @9strands and I open a separate issue for resource contention for inventory nodes, primarily RHEL/CentOS nodes?

cloin commented 3 years ago

@heatmiser @9strands yeah lets create a separate issue for that, please. Looks like most of the reports here are for the inventory nodes and not control nodes.

Closing. Please reopen for specific nodes (control nodes and inventory nodes use different instance types and would need to be adjusted separately if there are recurring issues).

idhaoui commented 3 years ago

@cloin Even if you try to stop the boinc-client, some of nodes are not responsive... they become sporadically available and I didn't spend the time to understand why. This week alone I had 3 labs and the only way forward to reach a stable state with the setup is by disabling boinc-client in all nodes (aka node1, node2, node3). If it's really needed to be there, I would suggest to have it disabled by default.

https://github.com/ansible/workshops/issues/1198

cloin commented 3 years ago

@idhaoui this was done in #1200 and released to master in #1201