Closed rhjcd closed 3 years ago
@rhjcd how often is this coming up? is this common or a one-off? Is it coming up for more than one type of workshop?
it's been brought to my personal attention 4 times through ServiceNow tickets which are unfortunately now archived.
should be solved with https://github.com/ansible/workshops/pull/1193, please re-open if this persists
Hi, @IPvSean - would the following INC I received today from a student qualify for this? Student lab info is for title: PROD_ANSIBLE_DEMOS-555c in the deployer.
Note from Student:
The nodes are under-resourced and an OpenSCAP scan cannot be completed without being terminated by process oom-kill
from the kernel.
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-11.scope,task=oscap,pid=21068,uid=0
kernel: Out of memory: Killed process 21068 (oscap) total-vm:1787452kB, anon-rss:372460kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1292kB oom_score_adj:0
kernel: oom_reaper: reaped process 21068 (oscap), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
@IPvSean, I know that this is now closed but I want to provide you with a response to your question. I have attempted to use this demo about four times within ~2 months now unsuccessfully. The scan does initiative with return with Killed
, with the above logs in /var/log/messages
.
Reopening to have a place to document OOM issues on RHEL workshops. Changes to instance types as requested will affect operating budget. Thanks for the help!
We have another user report that seems to be related to this as well: TASK0987679
During the last Hands-on session (May 2021), some VMs became unresponsive and we were not able to reboot them.
During this event preparation, I also noticed that my VMs hung time to time. I found out that the VMs run « wcgrid_opn1_aut » processes uses all the CPU.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19855 boinc 39 19 166624 75096 2416 R 99.3 9.4 0:04.55 wcgrid_opn1_aut
19857 boinc 39 19 165948 74592 2416 R 99.3 9.3 0:03.55 wcgrid_opn1_aut
It look like these processes are related to BOINC project https://fr.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing
@cloin - how are BOINC processes running on the infrastructure? I find that quite unusual, especially as that would cause a likely large increase in CPU, memory, and disk utilization - I have worked at orgs which had this on each person's workstation and I recall the issues that we had whenever there was high load on the desktop.
Another one from the EMEA regions: TASK0987385
I am delivering an ansible workshop for a large customer later today. I have deployed lab with GUID 22ab. Deployment looked successful.
I noticed that studen17-node3 and student45-node3 with respective IPs *.*.*.* and *.*.*.* are not responding to simple ssh connection and I have no way to restart those instances.
Can you please assist and eventually run a sanity check on the whole environment if you have a ready script
I saw a similar issue today for the RHEL Automation workshop...4 to 5 students experienced losing connectivity with their instances around exercises 1.4 to 1.5. Not even doing anything real intensive. I logged into a lab env and worked through the lab exercises, I lost connectivity while on exercise 1.5 during the ftp deploy playbook while it was running the fact scans. After that node3 was unreachable and took 40+ minutes until it eventually came back. I did notice that the instances for the RHEL nodes were t3.micro instances, where Amazon does document this type of behavior...
If the system logs don't contain disk full errors, view the CPUUtilization metric for your instance. If the CPUUtilization metric is at or near 100%, the instance might not have enough compute capacity for the kernel to run.
Run a bunch of these burstable instances and chew through CPU credits quickly, this is the type of behavior you're going to see. They aren't meant to be utilized for interactive workloads, where they are all being used at exactly the same time...like during a workshop, for example.
Note that several months back, a few of us did run into similar behavior that has been reported in some other comments here earlier about running OpenSCAP scans..we were originally using t3.micro instances for the RHEL7 and CentOS7 nodes while working on building out the Smart Management workshop content, however, they kept crashing during OpenSCAP scans and Convert2RHEL runs, so we moved to t2.medium instance sizes and after that, no more crashes.
@9strands re: BOINC: this should be disabled when a student begins interacting with the environment. Additionally, there should be a job template that allows for disabling BOINC.
I will make a release later today setting community_grid
(BOINC) to false and monitor this issue.
FYI, this is what I'm talking about: https://github.com/ansible/workshops/tree/devel/provisioner#ibm-community-grid
However, community_grid
is not on the control node, so not going to affect memory conditions on tower. Any reference to nodes not control nodes should be separate issues. Although, @heatmiser @9strands, my previous comment might resolve the issues you've seen on the inventory nodes.
@cloin In the case of the Smart Management workshop, we were already disabling community_grid
for our test deployments as we worked through building out the workshop content. Apologies, I now understand that this issue is for the control node...I mistook the conversation about OpenSCAP scans failing to mean that this issue was trying to address all nodes. Do you recommend that @9strands and I open a separate issue for resource contention for inventory nodes, primarily RHEL/CentOS nodes?
@heatmiser @9strands yeah lets create a separate issue for that, please. Looks like most of the reports here are for the inventory nodes and not control nodes.
Closing. Please reopen for specific nodes (control nodes and inventory nodes use different instance types and would need to be adjusted separately if there are recurring issues).
@cloin Even if you try to stop the boinc-client, some of nodes are not responsive... they become sporadically available and I didn't spend the time to understand why. This week alone I had 3 labs and the only way forward to reach a stable state with the setup is by disabling boinc-client in all nodes (aka node1, node2, node3). If it's really needed to be there, I would suggest to have it disabled by default.
@idhaoui this was done in #1200 and released to master in #1201
Problem Summary
Per email:
[] we have received the below request from a trainer.
I conducted a workshop today of Ansible for Linux Administration for 25 students. The instance types for the ansible's control node, tower, and vscode (all the same node) experienced oom-killer outages. This was disruptive to the student's experience. Can we get the resources increased for the ansible control node? Currently, it's set at 2 CPUs and 4GB ram.
Issue Type
Bug
Extra vars file
vars
Ansible Playbook Output
https://gist.github.com/rhjcd/9f9e7b27b1fa39304663d1b015207e1e
Ansible Version
$ ansible --version ansible 2.9.21 config file = /home/student7/.ansible.cfg configured module search path = ['/home/student7/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python3.6/site-packages/ansible executable location = /usr/bin/ansible python version = 3.6.8 (default, Mar 18 2021, 08:58:41) [GCC 8.4.1 20200928 (Red Hat 8.4.1-1)]
Ansible Configuration
DEFAULT_HOST_LIST(/home/student7/.ansible.cfg) = ['/home/student7/lab_inventory/hosts'] DEFAULT_STDOUT_CALLBACK(/home/student7/.ansible.cfg) = yaml DEFAULT_TIMEOUT(/home/student7/.ansible.cfg) = 60 DEPRECATION_WARNINGS(/home/student7/.ansible.cfg) = False HOST_KEY_CHECKING(/home/student7/.ansible.cfg) = False PERSISTENT_COMMAND_TIMEOUT(/home/student7/.ansible.cfg) = 200 PERSISTENT_CONNECT_TIMEOUT(/home/student7/.ansible.cfg) = 200 RETRY_FILES_ENABLED(/home/student7/.ansible.cfg) = False
Ansible Execution Node
Ansible Controller (previously known as Ansible Tower)
Operating System
rhpds provision for Ansible Workshop