akondrahman / IaCTesting

Placeholder for the research study related to IaC testing anti-patterns
3 stars 256 forks source link

Feedback from Reading Group #33

Closed akondrahman closed 3 years ago

akondrahman commented 3 years ago

@Talismanic

We read your ICST paper in reading group today. Please provide me the following items in this issue within 24 hours or so if you can. I want to finalize the paper by integrating all these items, and send it to you on Friday morning:

  1. Can you find an Ansible test script that has an instance of local only test and also is modified in a bug-related commit ?
  2. In the case of performance bugs can provide a detailed example of a bug that occurs due to Ansible variable handling and an an example bug due to lack of caching?
  3. Please describe the fix to the bug For thekubernetes-sigs/kubespray' project, Ansible test scripts were reported to be slow because of pulling Docker images for every build` that we listed for performance?
  4. For style bug, can you please tell the consequence of Style bugs, such as trailing spaces after\texttt{:}' and wrong indentation occurred in an Ansible test script for the OSS rocknsm/rock' project ?
  5. Can you give me the number of unique Ansible test scripts in which at least one category of the three testing pattern appears? If a test script has instances of two/three types of testing patterns, please count it once.
  6. Please explain why the number of commits (2,606) is much smaller than number of Ansible test scripts (4,831). Usually there are more commits than scripts/files.

Please let me know if there are questions.

akondrahman commented 3 years ago

@Talismanic there were comments on if correlation is enough to suggest avoidance. I fixed the entire paper to handle this. There were other comments form my reading group on motivation and clarity, which I have addressed.

Talismanic commented 3 years ago
  1. Can you find an Ansible test script that has an instance of local only test and also is modified in a bug-related commit ?

Project Name: CentOS-PaaS-SIG/linch-pin Commit Hash: 4905430ab36c5585bd30e623ccb5155cfe127585 File Name: CentOS-PaaS-SIG\linch-pin\playbooks\tests\unit_tests.yml Bug Category: configuration_bug Code Change:

           - name: debug
             debug:
               msg: "Enabling contra-hdsl tests"
-          - name: Install Test Reqs
-            pip:
-              name: .[tests]
-              virtualenv: /tmp/linchpin
+          - name: shell for ansible version
+            shell: "ansible --version"
+          - name: shell to find version of setuptools
+            shell: "python -c 'import setuptools; print(setuptools.__version__)'"
+          - name: create virtualenv
+            shell: "python3 -m venv tutorial-env"
+            args:
+              chdir: /tmp/
+          - name: "Current directory"
+            shell: "echo $PWD; ls;"
+          - name: Install test requirements
+            shell: |
+                  source /tmp/tutorial-env/bin/activate
+                  pip3 install .[tests]
+            args:
               chdir: ../../
           - name: pyTest
             shell: |
-                  source /tmp/linchpin/bin/activate
+                  source /tmp/tutorial-env/bin/activate
                   python setup.py test
             args:
               chdir: ../../
           - name: Flake8 tests
             shell: |
-                  source /tmp/linchpin/bin/activate
+                  source /tmp/tutorial-env/bin/activate
                   flake8 --exclude=\.eggs,tests,docs,config/Dockerfiles --ignore=E124,E303,W504 --max-line-length 80 .
             args:
               chdir: ../../
Talismanic commented 3 years ago
  1. In the case of performance bugs can provide a detailed example of a bug that occurs due to Ansible variable handling and an an example bug due to lack of caching?

  2. Please describe the fix to the bug For the kubernetes-sigs/kubespray' project, Ansible test scripts were reported to be slow because of pulling Docker images for every build` that we listed for performance?

Ansible Variable

Problem: In a test script of project ceph/ceph-ansible, developers need to identify the OS category (Debian or Redhat) and set the file permission and group accordingly. Using ansible_os_family variable proved to be slower in such scenario.

Solution: Ansible fact is a common way to find the host related information. This is a builtin Ansible module which provides a fast way to collect different meta data about the host where the play is running. For example, When the contributors refactored this with ansible_facts variable and picked os_family from there, the test task execution time reduced a lot. This change was done on commit 7ddbe7471228b410d56d36b00a449d1cb8434d4a.

Git Diff:

   - set_fact:
       owner: 167
       group: 167
-    when: ansible_os_family == "RedHat"
+    when: ansible_facts['os_family'] == "RedHat"

   - set_fact:
       owner: 64045
       group: 64045
-    when: ansible_os_family == "Debian"
+    when: ansible_facts['os_family'] == "Debian"

Caching:

Problem:

kubernetes-incubator/kargo project use tasks where some docker images are required. Initially, docker images were not cached and everytime they were downloaded from remote repository. That was taking a lot of time.

Solution: Docker image download from image repository takes considerable amount of time depending the size of image. So in a play if we download same docker image multiple times, our test execution time will be increased. However, docker utility also keep docker images cached in the machine where it is running. So by utilizing that cached image we can significantly reduce the test execution time. So on commit 66408a87ee1755f0a36a3a0001fcb58ffa8a31b2 contributors apply below patch which prevented image download if it is already in cache.

Git Diff:

+- name: Download images to ansible host cache via first kube-master node
+  hosts: kube-master[0]
+  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
+  roles:
+    - { role: kubespray-defaults, when: "not skip_downloads and download_run_once and not download_localhost"}
+    - { role: kubernetes/preinstall, tags: preinstall, when: "not skip_downloads and download_run_once and not download_localhost" }
+    - { role: download, tags: download, when: "not skip_downloads and download_run_once and not download_localhost" }
+  environment: "{{ proxy_env }}"
+
Talismanic commented 3 years ago
  1. Can you give me the number of unique Ansible test scripts in which at least one category of the three testing pattern appears? If a test script has instances of two/three types of testing patterns, please count it once.

If we consider our FSE dataset, then it is 1106

Note to self:

SELECT distinct file_name FROM test.iac_anti_patterns_v2 where Local_Only_Test > 0 or Assertion_Roulette > 0 or External_Dependency > 0;

If we consider, new dataset where bug related categorization is done, it is 957.

Note to self:

select distinct file_name from commit_messages_23 where LOT > 0 or AR>0 or ED>0;

Difference in count is, during commit message parsing some files were not parsable by the parser due to encoding related issues.

Talismanic commented 3 years ago
  1. Please explain why the number of commits (2,606) is much smaller than number of Ansible test scripts (4,831). Usually there are more commits than scripts/files.

Bhaiya, from my experience, commit count is always less than the no of scripts. Because in a single commit we touch several scripts , for example task file, role file, variable file etc. Morever I also run a query to group by the files w.r.t commit and saw this was also the case for our dataset. Data is attached.

file_count_per_commit.zip

Note to self:

select distinct commit_hash as comm, count(file_name) as file_count from commit_messages_23 group by comm order by file_count desc;
Talismanic commented 3 years ago

@akondrahman Bhai, for item 4 I will get back to you soon.

akondrahman commented 3 years ago

@Talismanic Except for item#4, all your content has been added. Thanks for the hard work.

Talismanic commented 3 years ago

4.For style bug, can you please tell the consequence of Style bugs, such as trailing spaces after \texttt{:}' and wrong indentation occurred in an Ansible test script for the OSS rocknsm/rock' project ?

@akondrahman Bhai, for this specific case I don't see any harmful consequence as this is very minor thing. However, Ansible lint sometimes detects more important stuffs. For example:

     - name: Get sysctl content
-      shell: |
-        cat /etc/sysctl.conf
-      register: sysctl_content
+      set_fact:
+        sysctl_conf: "{{ lookup('file', '/etc/sysctl.conf') }}"
+

This snippet is extracted from project os-cloud/openstack-ansible-lxc_hosts and commit 0d28eeab560569cb3d220b4c47b5b0212eb0b98f. The earlier part (deleted from the git diff) indicates that, this test script used shell module to detect the content of sysctl.conf file. However, Ansible's official documentation suggests not to use shell unless it is absolutely necessary: "If you want to execute a command securely and predictably, it may be better to use the command module instead. Best practices when writing playbooks will follow the trend of using command unless the shell module is explicitly required. When running ad-hoc commands, use your best judgement." (link) This mainly makes the code less reusable and maintainable. For example, shell command for doing something may be different in different OS. For this reason, we see that practitioners refactored this code with lookup and file modules so that those can universaly work in any system.

Does this make any sense Bhaiya?

akondrahman commented 3 years ago

Makes sense. I have added all these content. There was a question after the reading group on the distribution of sub-categories, like how many of the configuration bugs were os, network, and storage-related ? And how many bugs were related to Docker image caching and Ansible vars? Do you have these numbers?

Thanks @Talismanic again for the hard work.

Talismanic commented 3 years ago

@akondrahman Bhai,

Configuration Bugs: Network Config Bug: 11 OS Config Bug: 13 Storage Config Bug: 7 Software Config Bug: 24

In the paper it seemed we skipped Software Config Bug. These are the bugs where developer missed to pass specific software related configuration parameter, for example mysqld environment variable, conf.d files. ansible.cfg value mismatch etc.

Performance Bugs

Ansible Vars: 4 Caching: 7 Others: 6

Bhaiya, note that caching is generic here. Practitioners cached many things like docker images, package files, intermediate results to optimize performance time in our data set. Others category actually comprises of unique activities per situation.

akondrahman commented 3 years ago

@Talismanic give me example of a software config bug.

Talismanic commented 3 years ago

@akondrahman Bhai,

Interpretation :

This sub-category indicates invalid, malformed, or erroneous configuration values in the scripts. For example, in RedHat-cip/dci-ansible project had a configuration variable that was undefined, so this generated error messages which were solved in commit 2a372273c52dfacbf33efe88e93b1a84904c5254.

Example 01:

Project Name: RedHat-cip/dci-ansible Commit Hash: 2a372273c52dfacbf33efe88e93b1a84904c5254. Commit Message:
setup_env: Add registry variables for OpenStack

Since ansible 2.7 the registry variables are now required for the
task execution to avoid an error.

Git Diff:

@@ -25,6 +25,9 @@
           - puddles
         product_id: '{{ openstack_product.product.id }}'
         data:
+          registry:
+            login: null
+            password: null
           releasename: queens
       register: openstack_topic_osp1

Apart from that we have couple of more nice examples, you can use whichever you like:

Example 02:

Project Name: kubernetes-incubator/kargo Commit Hash: 188bae142ba47d16113232770dfecbb8e9fddaf1 Commit Message: Fix wait for hosts in CI 1679 Also fix usage of failed when and handling exit code. Git Diff:

   gather_facts: false
   tasks:
     - name: Wait for SSH to come up.
-      local_action: wait_for host={{ansible_host}} port=22 delay=60 timeout=240 state=started
+      local_action: wait_for host={{inventory_hostname}} port=22 delay=60 timeout=240 state=started

Example 03:

Project Name: ANXS/postgresql Commit Hash: d785326c51c9bfe98cded72d464554155bd9e089 Commit Message: Fix #445: PSQL tasks fails when postgresql_port != 5432 (default value) Git Diff:

--- a/tests/vars.yml
+++ b/tests/vars.yml
@@ -1,6 +1,7 @@
 ---

 postgresql_version: 11
+postgresql_port: 5433

 postgresql_databases:
   - name: foobar

Example 04:

Project Name: os-cloud/openstack-ansible-lxc_hosts Commit Hash: 158d035b92ae7e239f0c1bed78727eaa7ecb2942 Commit Message: Make corrections to LXC bridge template file This change adjusts a few of the modifications made to the lxc net bridge.cfg.j2 template file in change I3c8225124a5f18db81259e1d52d0168ef52c3c17. The minus signs have been removed from if and endif blocks so that whitespace is kept intact between sections. The ordering of post up and post down commands has also been changed so that iptables rules are created before the dnsmasq service is started as they were previously. The default value of lxc net gateway has also been changed to null so that it s evaluated as expected. Its current value none is evaluated as a string. A test has been added to compare the contents of the deployed lxc bridge interface file with its expected contents. Change Id I39d7b3f40de6ac691550c11d71bb6a182b3452f4 Git Diff:

       register: lxc_bridge_file
     - name: Check dnsmasq is running
       shell: ps auxfww | grep -w 'dnsmasq -u lxc-dnsmasq'
+    - name: Get deployed interface file contents, without Ansible managed line
+      shell: |
+        cat /etc/network/interfaces.d/lxc-net-bridge.cfg | tail -n +3
+      register: interface_file
+    - name: Get expected interface file contents
+      shell: |
+        cat files/expected-lxc-net-bridge.cfg
+      register: expected_interface_file
     - name: Check role functions
       assert:
         that:
akondrahman commented 3 years ago

@Talismanic except for example 02, all can be mapped to existing config sub-categories. Would you agree?

I think my confusion will be fixed by giving examples for the following:

mysqld environment variable, conf.d files. ansible.cfg value mismatch

Talismanic commented 3 years ago

@Talismanic except for example 02, all can be mapped to existing config sub-categories. Would you agree?

@akondrahman Bhaiya, I am not confident to label this to other categories. Wrong host groups were passed initially which was fixed. It's neither OS or Network or Storage dependent.

I think my confusion will be fixed by giving examples for the following:

MySQL Example:

Project Name: os-cloud/openstack-ansible-galera_server Commit Hash: cd11c5a56e96c0fc06d96be755a654a12898e02d Commit Message:

Updated repo for new org The role was changed to make it compatible with the OpenStack CI. The changes effect defaults handlers and the tests for the role and adds gitignore review files. The changes essentially get the role to a state where its passing the tests which are spinning up a galera cluster adding users and databases and then testing integrity from every node. The tests specifically ensure we re able to guarantee that after the role runs everything works. Previously to these changes the role assumed everything was working and nothing had been done to guarantee cluster state. In the handler changes the temporary sst directory is cleaned up should the handler restart fail. This ensure that a node is in a clean state if a leftover sst directory was on the disk which would cause a node to fail to join a cluster or bootstrap. Additionally the environment variable MYSQLD STARTUP TIMEOUT is now being passed into the init script because the defaults are not being sourced at the init script runtime. In the task changes a new configuration file that is part of the mariadb package is being removed which has unforeseen options within it causing no logs to be created. the default option galera innodb additional mem pool size was removed because its no longer valid within MariaDB10 and we d never caught that error message until now. The tests were updated to support running the role from a user which was not root. Change Id I16af30c660790656fc2d59f9943c172b88098905 Signed off by Kevin Carter kevin.carter rackspace.com

Git Diff:

 - name: Playbook for role testing
   hosts: galera_all
+  serial: 1
+  user: root
   gather_facts: true
   vars:
     galera_root_password: secrete
     galera_root_user: root
+    galera_innodb_buffer_pool_size: 512M
+    galera_innodb_log_buffer_size: 32M
+    galera_server_id: "{{ inventory_hostname | string_2_int }}"
+    galera_wsrep_node_name: "{{ inventory_hostname }}"
+    galera_wsrep_provider_options:
+      - { option: "gcache.size", value: "32M" }
   roles:
     - role: "{{ rolename | basename }}"
       galera_server_id: "{{ inventory_hostname | string_2_int }}"

Conf Example:

Project Name: os-cloud/os-ansible-deployment Commit Hash: 3c8548b265c861ffac5e097fe0715f76393e5363 Commit Message:

Git Diff:

           - name: keystone.yml.aio
           - name: neutron.yml.aio
           - name: nova.yml.aio
+        upgrade:
+          # This starts as an AIO box, then an upgrade is run
+          - name: aodh.yml.aio
+          - name: cinder.yml.aio
+          - name: ceilometer.yml.aio
+          - name: designate.yml.aio
+          - name: glance.yml.aio
+          - name: gnocchi.yml.aio
+          - name: heat.yml.aio
+          - name: horizon.yml.aio
+          - name: keystone.yml.aio
+          - name: neutron.yml.aio
+          - name: nova.yml.aio
+          - name: swift.yml.aio
   vars:
     scenario: "{{ lookup('env','SCENARIO') | default('aio', true) }}"
     sshd:

Ansible Config Example:

Project Name: os-cloud/openstack-ansible-galera_server Commit Hash: 5bcdd4b77234d3672d98ae4ab7ea585c3d323581 Commit Message: Update tox config and add bashate E006 E040 exceptions This patch updates the tox.ini the same bashate exceptions as are currently in the OpenStack Ansible playbook repo. It also ensures that the linters and all lint targets work appropriately and normalises the tox.ini configuration to use uniform formatting. The use of ansible.cfg is removed as there is no way of being certain which paths can be used without reverting to an ugly sed hack in the commands. This is why it is preferred to make use of environment variables which make use of tox s default substitutions instead. It s a more reliable way of achieving the goal for the purpose of gating and testing. The switch to using a git clone instead of ansible galaxy to download the plugins is due to the path spec not being able to work in Ansible 2.x. 1 1 https github.com ansible ansible issues 13563 Change Id I077c5b79a95e78fd81c33382843273893d48bbb6

Git Diff:

           name: "trusty.tgz"
           sha256sum: "56c6a6e132ea7d10be2f3e8104f47136ccf408b30e362133f0dc4a0a9adb4d0c"
           chroot_path: trusty/rootfs-amd64
-      # The $HOME directory is mocked to work with tox
-      #  by defining the 'ansible_env' hash. This should
-      #  NEVER be done outside of testing.
-      ansible_env:  ## NEVER DO THIS OUTSIDE OF TESTING
-        HOME: "/tmp"
     - role: "py_from_git"
       git_repo: "https://github.com/lxc/python2-lxc"
       git_dest: "/opt/lxc_python2"

I think we can rename the category as something like 'Generic Configuration Bugs'.

akondrahman commented 3 years ago

Thanks @Talismanic ... this is good stuff.