CNES / openbach

Open Metrology Testing framework : Internet Network and service testing
http://www.openbach.org
GNU General Public License v3.0
7 stars 3 forks source link

Anible Error while trying to Reattach OpenBach Agents #2

Open godelc7 opened 1 year ago

godelc7 commented 1 year ago

I got an ansible error(see below) when trying to reattach agents after detaching them using the auditorium scripts. Detaching works fine, but reattaching them again always fails.

[My OpenBach Topology]:

[CLI Command]: python3 /usr/local/lib/python3.8/dist-packages/auditorium_scripts/install_agent.py 192.168.1.211 192.168.1.210 TrafficGenerator1 --username --password --controller 192.168.1.210 --reattach

[ERROR]: {'response': {'192.168.1.211': [{'_ansible_no_log': False, 'msg': "The conditional check '{{ item.remove }}' failed. The error was: error while evaluating conditional ({{ item.remove }}): 'dict object' has no attribute " "'remove'\n" '\n' "The error appears to be in '/opt/openbach/controller/ansible/push_files.yml': line 26, column 7, but may\n" 'be elsewhere in the file depending on the exact syntax problem.\n' '\n' 'The offending line appears to be:\n' '\n' 'msg': "The conditional check '{{ item.remove }}' failed. The error was: error while evaluating conditional ({{ item.remove }}): 'dict object' has no attribute " "'remove'\n" '\n' "The error appears to be in '/opt/openbach/controller/ansible/push_files.yml': line 26, column 7, but may\n" 'be elsewhere in the file depending on the exact syntax problem.\n' '\n' 'The offending line appears to be:\n' '\n' '\n' ' - name: Remove file on source\n' ' ^ here\n'}], 'error': 'Ansible playbook execution failed'}, 'returncode': 422} {'response': {'192.168.1.211': [{'msg': "The conditional check '{{ item.remove }}' failed. The error was: error while evaluating conditional ({{ item.remove }}): 'dict object' has no attribute 'remove'\n\nThe error appears to be in '/opt/openbach/controller/ansible/push_files.yml': line 26, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Remove file on source\n ^ here\n", '_ansible_no_log': False}], 'error': 'Ansible playbook execution failed'}, 'returncode': 422}

Kniyl commented 1 year ago

Hi,

Are you able to edit the /opt/openbach/controller/ansible/{push,pull}_files.yml files on your controller and change line 32 (resp. 23) from when: "{{ item.remove }}" to when: "{{ item.remove | default(False) }}"?

If so, does it fix your issue?

godelc7 commented 1 year ago

It seems to work fine. At least, the agents are now present in the data base, even though I still need to add them manually to the project topology. I will run my scenarios to see if everything still works as expected.

What wonders me is that, even though reattaching procedure itself seems to be successful, there is a text generated on the CLI saying that the ansible playbook execution failed. Here it is:

[ERROR] {'assign_collector': None, 'install': {'last_operation_date': '2023-06-06T17:49:43.802Z', 'response': None, 'returncode': 204}, 'log_severity': None, 'uninstall': {'last_operation_date': '2023-04-25T13:02:35.073Z', 'response': {'response': {'192.168.1.211': [{'_ansible_item_label': 'deb https://raw.githubusercontent.com/CNES/net4sat-packages/master/focal/ focal stable', '_ansible_no_log': False, 'ansible_loop_var': 'item', 'changed': False, 'invocation': {'module_args': {'codename': None, 'filename': None, 'install_python_apt': True, 'mode': None, 'repo': 'deb https://raw.githubusercontent.com/CNES/net4sat-packages/master/focal/ focal stable', 'state': 'absent', 'update_cache': True, 'update_cache_retries': 5, 'update_cache_retry_max_delay': 12, 'validate_certs': True}}, 'item': 'deb https://raw.githubusercontent.com/CNES/net4sat-packages/master/focal/ focal stable', 'msg': 'Failed to update apt cache: unknown reason'}, {'changed': False, 'msg': 'One or more items failed', 'results': [{'_ansible_item_label': 'deb https://raw.githubusercontent.com/CNES/net4sat-packages/master/focal/ focal stable', '_ansible_no_log': False, 'ansible_loop_var': 'item', 'changed': False, 'failed': True, 'invocation': {'module_args': {'codename': None, 'filename': None, 'install_python_apt': True, 'mode': None, 'repo': 'deb https://raw.githubusercontent.com/CNES/net4sat-packages/master/focal/ focal ' 'stable', 'state': 'absent', 'update_cache': True, 'update_cache_retries': 5, 'update_cache_retry_max_delay': 12, 'validate_certs': True}}, 'item': 'deb https://raw.githubusercontent.com/CNES/net4sat-packages/master/focal/ focal stable', 'msg': 'Failed to update apt cache: unknown reason'}]}], 'error': 'Ansible playbook execution failed'}, 'returncode': 422}, 'returncode': 422}} Operation successfull

Kniyl commented 1 year ago

Yes, that's odd, there is clearly an error, as indicated by the 422 return code. Operation successfull is misleading here and I will have a look into it.

However, the culprit here is 'msg': 'Failed to update apt cache: unknown reason'. You should have a look into your machine to try and fix it, otherwise it might prevent you from doing other actions as well, such as, for instance, installing new jobs into the agent.

godelc7 commented 1 year ago

Yes, that's odd, there is clearly an error, as indicated by the 422 return code. Operation successful is misleading here and I will have a look into it.

Operation successful is indeed misleading here, but only partly. The agents got reattached. I can rebuild my entire topology with these agents and run other tests as well. The thing is that, some critical services like time synchronization(NTP) with the controller are not restored after the reattachment. Since the time I detached and reattached the agents, I got very strange results from my tests. I have been investigating the cause in many different directions. Only today, I figured out that the lacking of a time synchronization with the controller is the root cause of my wrong results. In sum, the detachment of agents works properly and reliably, but reattachment only works partly. The part that is not working is to my opinion very critical. Therefore, I propose this to be considered as critical bug.

However, the culprit here is 'msg': 'Failed to update apt cache: unknown reason'. You should have a look into your machine to try and fix it, otherwise it might prevent you from doing other actions as well, such as, for instance, installing new jobs into the agent.

I'm aware of this problem with APT on my machines. The reason is that I'm working in a very restricted environment. Many APT mirrors are blocked by the company. But the few APT mirrors that are allowed are sufficient for my daily work. And up to now, I could install every jobs that I have needed on the agents.