akondrahman / IaCTesting

Placeholder for the research study related to IaC testing anti-patterns
3 stars 214 forks source link

TELIC Paper Content Discussion #21

Closed Talismanic closed 3 years ago

Talismanic commented 3 years ago
  1. TELIC identifies a test script to include a test play if a play within a script includes (i) one of the following keywords: ‘check’, ‘determine’, ‘ensure’,‘test’, ‘validate’, and ‘verify’. Actually TELIC classifies a test script if it is under "tests" directory and have yml/yaml extension.

  2. I have not determined Total Test Plays and LOC. Have you calculated those bhaiya from the scripts?

  3. For selecting the oracle dataset I used RAND() function of MySQL to detect 100 random scripts from our anti-pattern database.

  4. In Listing 6, our example of adding the yum repositories from external URL is actually the only way adding new repository. But if we do installation of a package from external repository that will be anti-pattern. For example, following is a hypothetical example of anti-pattern:

    
    - name: Downloading nginx rpm,
      get_url:
        url: http://nginx.org/packages/centos/{{ansible_distribution_major_version}}/noarch/RPMS/nginx-release-centos-{{ansible_distribution_major_version}}-0.el{{ansible_distribution_major_version}}.ngx.noarch.rpm
        dest: /tmp/ngx.noarch.rpm
    
    - name: Install nginx
       yum:
            name: /tmp/ngx.noarch.rpm
            state: present
The right way to do this would have been simply taking full advantage fro yum module:
akondrahman commented 3 years ago

@Talismanic Happy new year to your and your family ... let this be year of you getting your first flagship paper at FSE 2021!

Talismanic commented 3 years ago

@Talismanic Happy new year to your and your family ... let this be year of you getting your first flagship paper at FSE 2021!

Thanks for the wish Bhaiya. Happy new year to you and your family also.

Please let me know what is the next thing I need to do.

akondrahman commented 3 years ago

New instructions here: #22

akondrahman commented 3 years ago

@Talismanic Need a better example than the following. Or, explain why do you think the following can cause maintainability issues? Along similar lines why do you think having an external URL can be problematic for test code?

    - name: Downloading nginx rpm,
      get_url:
        url: http://nginx.org/packages/centos/{{ansible_distribution_major_version}}/noarch/RPMS/nginx-release-centos-{{ansible_distribution_major_version}}-0.el{{ansible_distribution_major_version}}.ngx.noarch.rpm
        dest: /tmp/ngx.noarch.rpm

    - name: Install nginx
       yum:
            name: /tmp/ngx.noarch.rpm
            state: present
Talismanic commented 3 years ago

Or, explain why do you think the following can cause maintainability issues?

Bhaiya, there are couple of reasons:

  1. Fetching data from external urls can always be flaky and subject to network connectivity. For example, servers may be behind the proxy or firewalls.

  2. Reading installation file from a location can be dependent on different permission level issues.

Along similar lines why do you think having an external URL can be problematic for test code?

  1. Point no 1 mentioned above

  2. Location of the resources can be changed on the web service.

akondrahman commented 3 years ago

OK then. Can you please write your explanation in an academic fashion and send the writing in a text file?

Talismanic commented 3 years ago

OK then. Can you please write your explanation in an academic fashion and send the writing in a text file?

Sorry Bhaiya. I missed this message some how. Will below explanation work?

In the case of test play "Install the EPEL repository", the Nginx package needs to be downloaded and installed to
test if Nginx is correctly installed by Ansible Playbook. However, in this example, rpm packages are 
downloaded over HTTPS web-service in the first place which made the whole test script dependent
on the availability of the package on that specific web URL location. If the package is shifted to any
other location, the test may fail though the other parts of the test playbooks may work fine. Apart from 
that, this dependency on remote web service inherently imposes the constraint of the necessity of network
connectivity between the test infrastructure and remote infrastructure where the rpm is hosted. 
So if there is no internet connectivity on the test infrastructures or there are firewalls between test
infrastructures and remote infrastructure, this test will again fail. So overall, Remote mystery makes
the test scripts flaky and couples the test scripts with external dependencies tightly.

remote_mystery.txt

akondrahman commented 3 years ago

@Talismanic The code in question: where did you get it from?

- name: Downloading nginx rpm,
      get_url:
        url: http://nginx.org/packages/centos/{{ansible_distribution_major_version}}/noarch/RPMS/nginx-release-centos-{{ansible_distribution_major_version}}-0.el{{ansible_distribution_major_version}}.ngx.noarch.rpm
        dest: /tmp/ngx.noarch.rpm

    - name: Install nginx
       yum:
            name: /tmp/ngx.noarch.rpm
            state: present
Talismanic commented 3 years ago

The code in question: where did you get it from?

Bhaiya, this is not from any of our repos. Its from gist. Here is the link.

Talismanic commented 3 years ago

@akondrahman Bhai, However, I have found better examples in our mined repos. For example, in below file I found one task which is similar:

C:\mined_repos\yanyao\openstack-deployment\tests\roles\bootstrap-host\tasks\prepare_libvirt_service.yml
- name: Download LibVirt CPU map configuration script
  get_url:
    url: "http://git.openstack.org/cgit/openstack-dev/devstack/plain/tools/cpu_map_update.py?h=a631abadde7346b49fab5b2ac16561dff77050d7"
    dest: /openstack/cpu_map_update.py
    validate_certs: yes
    mode: 755
  register: libvirt_cpu_map_download
  tags:
    - libvirt-cpu-map-download

- name: Execute LibVirt CPU map configuration script
  shell: /openstack/cpu_map_update.py /usr/share/libvirt/cpu_map.xml
  when: libvirt_cpu_map_download | changed
  tags:
    - libvirt-cpu-map-updated

In the case of test play "Download LibVirt CPU map configuration script", one python script is downloaded over HTTPS web-service in the first place which made the whole test script dependent on the availability of the script on that specific web URL location. If the script is shifted to any other location, the test may fail though the other parts of the test playbooks may work fine. Apart from that, this dependency on remote web service inherently imposes the constraint of the necessity of network connectivity between the test infrastructure and remote infrastructure where the python script is hosted. So if there is no internet connectivity on the test infrastructures or there are firewalls between test infrastructures and remote infrastructure, this test will again fail. Finally, after downloading the script, this task is putting the script into a file directory (e.g /openstack). This part again made the task error-prone. If the task runs in an environment where this directory is not present, it will again fail. So overall, Remote Mystery makes the test scripts flaky and couples the test scripts with external dependencies tightly.

akondrahman commented 3 years ago

@Talismanic great example above ^ If you can find sth. of this level for the other 4 categories let me know in this issue. Keep looking if you can.

Talismanic commented 3 years ago

Assertion Roulette

Project Location

C:\mined_repos\openstack\openstack-ansible\tests\bootstrap-aio.yml

Sample Code Base:

  pre_tasks:
    - name: Run setup module
      setup:
        gather_subset:
          - network
          - hardware
          - virtual
  post_tasks:
    - name: Check that new network interfaces are up
      assert:
        that:
          - ansible_eth12['active'] | bool
          - ansible_eth13['active'] | bool
          - ansible_eth14['active'] | bool
      when:
        - (bootstrap_host_container_tech | default('unknown')) != 'nspawn'

Here we can see, in the "Check that new network interfaces are up" play, all three interfaces have been checked in the same assert block. So, if the play fails, it will be tough to determine which network interface is actually down.

Talismanic commented 3 years ago

Disabled Lint Checking

Project Location:

C:\mined_repos\openstack\openstack-ansible-lxc_container_create\tests\test-containers-functional.yml

Sample code snippet:

- name: Check for the presence of the right bound mount for container1
      command: grep "lxc.mount.entry = /openstack/log/container1" /var/lib/lxc/container1/config
      tags:
        - skip_ansible_lint

    - name: Check for the presence of the default bound mount for container3
      command: grep "lxc.mount.entry = /openstack/backup/container3" /var/lib/lxc/container3/config
      tags:
        - skip_ansible_lint

In both plays we can see that ansible lint has been disabled. The plays are validating the presence of specific strings in the config file. Here can can see native shell command module has been used. For this reason, there can be a probability that ansible-lint will detect this as an issue. In similar cases, we often see lineinfile module of ansible is used. For example, a sample implementation of this with lineinfile module could have been:

- name: Check for the presence of the default bound mount for container3
  lineinfile:
    dest: /var/lib/lxc/container3/config
    line: lxc.mount.entry = /openstack/backup/container3
  check_mode: 'yes'
  register: presence
  failed_when: presense.changed
Talismanic commented 3 years ago

Dirty Environment

Project Location:

C:\mined_repos\infOpen\ansible-role-airflow\tests\playbook.yml

Sample Code:

- hosts: 'airflow-vagrant-xenial64:airflow-docker-xenial'
  name: 'Install tests prerequisites'
  gather_facts: False
  tasks:
    - name: 'PREREQUISITES | APT | Do an apt-get update'
      become: True
      raw: 'apt-get update -qq'
      changed_when: False
    - name: 'PREREQUISITES | APT | Install python 2.7, iproute and net-tools'
      become: True
      raw: 'apt-get install -qq python2.7 iproute net-tools'
      changed_when: False

- hosts: 'all'
  roles:
    - role: "{{ role_name }}"
  vars:
    role_name: "{{ playbook_dir | basename }}"

In this example we can see that, before startin to run the test roles, some test pre-requisite tasks have been completed. For example, updating the apt repo and installing some dependencies. However, after the roles have been executed, we do not see any task which will remove the installed dependencies. So the environment remains dirty. For example, python2.7 has been installed as a dependency. But in same host if someone needs to test something which needs python 3+, some conflicts can arise. Also, if already python 3+ is running in the same environment, it may create some issues. So environment needs to be handled with proper care. In this case, practitioners can remediate this anti-pattern by any of the following any of the following means:

  1. By cleaning the installed dependencies after the test is over
  2. By using virtual environment of python if possible
  3. By using dedicated containers for this task as host
Talismanic commented 3 years ago

@akondrahman Bhai, For localhost testing, all examples are similar that hosts variable are set as localhost. Apart from that, I have managed above mentioned examples.

akondrahman commented 3 years ago

To use your Dirty Environment example I need a simple but clear explanation of test roles. How are test plays different from test roles? I will put it in the background section.

Talismanic commented 3 years ago

To use your Dirty Environment example I need a simple but clear explanation of test roles.

Bhaiya, will an example made by me or collected from internet will do?

How are test plays different from test roles?

I did not understand this question bhaiya.

akondrahman commented 3 years ago

They point to the same question. I need a 2-3 sentence paragraph the connection of roles with test plays in Ansible.

Talismanic commented 3 years ago

Roles are way of managing ansible playbook in a better way. Similar functionality might be needed to use in different Ansible Plays. For example, testing whether zookeeper is running might be required in multiple plays. So testing zookeeper status is normally isolated in a role and this role is used in multiple plays.

To see visually, we can see that in windmill repo we have below structure in roles directory where testing zeekeeper is an independent role:

image

In this role, zookeepers status is tested:

---
- name: Ensure zookeeper is running
  become: yes
  shell: /usr/sbin/service zookeeper status
  changed_when: false
  tags:
    - skip_ansible_lint

Then this role is used in the play where zoo-keeper status has been checked:

- name: Install zookeeper
  hosts: zookeeper

  tasks:
    - name: Setup openstack.zookeeper role
      include_role:
        name: openstack.zookeeper

  post_tasks:
    - name: Run zookeeper validation
      include_role:
        name: test.zookeeper

Independent roles do not have any hosts association, so they are not runnable independently. A play has a hosts association and can run roles in those hosts.

@akondrahman Bhai, does above explanation help?

akondrahman commented 3 years ago

You might have seen the e-mail from Greg, the practitioner: You do know that the very excellent ansible-lint already exists, yes?

How do you think we can address this in the paper?

Talismanic commented 3 years ago

You might have seen the e-mail from Greg, the practitioner: You do know that the very excellent ansible-lint already exists, yes?

How do you think we can address this in the paper?

Bhaiya, Purpose of Ansible Lint is different from TELIC. Ansible lint is more focussed on Ansible best practices, yaml syntaxing etc. However, TELIC is more focussed on the test smells.

Like Ansible Lint may not detect multiple assertions or importing playbook as a vice. However, TELIC will detect those as anti-pattern since those violates some established testing best practices. Even TELIC determines disabling ansible lint as an anti-pattern.

So I think existence of Ansible lint is not sufficient enough for detecting Test Smell in Ansible codes

Furthermore, in future if we work with Molecule, then testInfra's python test code will come to scene. That will make Ansible Lint more irrelevant in terms of test smells.

akondrahman commented 3 years ago

Can ansible-lint detect any of our five test smells?

Talismanic commented 3 years ago

Can ansible-lint detect any of our five test smells?

Bhaiya, better I will run ansible-lint on some sample mined repos and TELIC on the same. Then I will share the comparison.

Talismanic commented 3 years ago

@akondrahman Bhaiya, I have run ansible-lint and TELIC on the sample files from where we have given 4 examples in the above comments. We can see distinct output of them. Summary is below. I have also attached the output of ansible lint at the end of the comment.

Comparison:

filename TELIC Reporting ansible-lint reporting
\yanyao\openstack-deployment\tests\roles\bootstrap-host\tasks\prepare_libvirt_service.yml Mystery Guest Nothing
\openstack\openstack-ansible\tests\bootstrap-aio.yml Assertion Roulette Syntax Error
\openstack\openstack-ansible-lxc_container_create\tests\test-containers-functional.yml Disable Lint Checking Too many characters in some line
\infOpen\ansible-role-airflow\tests\playbook.yml Dirty Environment Nothing

Output File:

ansible-lint_telic-comp.txt

Next, I have checked the source code of ansible-lint and explored their rule engine on a high level from their github repo. I did not see any default rule which matches our 5 anti-pattern detection mechanism. Though I have not examined line by line, but high level naming conventions indicated that.

However, there is mechanism to exten ansible-lint with custom rule. People can attempt to extend ansible-list providing different config files to detect custom errors.

akondrahman commented 3 years ago

Thanks @Talismanic ... I am convinced that ansible-lint will not find what TELIC can find.

akondrahman commented 3 years ago

@Talismanic need a local only testing example for intro. Can you please send me one that is not in the paper already?

Talismanic commented 3 years ago

Project Location:

C:\mined_repos\AnsibleShipyard\ansible-mesos\tests\test.yml

Sample Code:

---
- hosts: localhost
  connection: local
  remote_user: root
  tasks:
    - shell: "ps aux | grep -i mesos"
      register: status
      failed_when: status.rc != 0
      when: ansible_service_mgr != 'systemd'

    - shell: "systemctl status mesos-master | grep running"
      register: status
      failed_when: status.rc != 0
      when: ansible_service_mgr == 'systemd'

In the above example, we can see that play is running on localhost. It is checking whether mesos process is running on the host and whether mesos-master is running as a service. However, this does not guarante that the test is self contained. For example, for mesos system needs to have java or openjdk. Now there can be scenario that localhost has all the dependencies installed correctly but in the environment where the actual production code will run, some dependencies failed. Now if we look at the output of the test, we will not understand whether the test failed due to bug in the implementation of mesos or any other environmental dependency. Hence we need to test this in remote clean environment or a production like environment to ensure every step of running mesos is working fine.

@akondrahman Bhai, will this example work?

akondrahman commented 3 years ago

@Talismanic ... perfect. See the intro. section now. Try to understand how I changed the intro. and let me know if you have questions.

Talismanic commented 3 years ago

@Talismanic ... perfect. See the intro. section now. Try to understand how I changed the intro. and let me know if you have questions.

Bhaiya, I read the intro and understand the story building. I tweaked some typos and sentences also.

akondrahman commented 3 years ago

@Talismanic good! I would like you to take a stab at the abstract. Think it over 1/2 days, write the abstract, and paste the abstract here. The new intro will help you to write the abstract. However, unlike the intro, you need to provide the most interesting results of the paper.

Talismanic commented 3 years ago

I would like you to take a stab at the abstract.

Ok Bhaiya. On it. Will start to think about it from tonight.

Talismanic commented 3 years ago

@akondrahman Bhai, 2 queries:

  1. Did Greg participated in the survey and agreed to any of our findings?
  2. Did we discard the plan of submitting bug report?
Talismanic commented 3 years ago

@akondrahman Bhai, First draft of the abstract:

Infrastructure as Code(IaC) is the modern DevOps practice of managing software infrastructure and packages automatically with tools like Ansible, Chef, etc. With the "as Code" suffix, this practice inherits some imperative parts of software development like automated testing, continuous integration, etc. To test the IaC code automatically, practitioners need to prepare test scripts and just like other software testing disciplines, IaC test scripts can also be susceptible to smells or anti-patterns. Testing anti-patterns are test codes which can lead to maintainability problem and hurt the quality of the production code. The goal of our research is to help practitioners to identify testing anti-patterns in IaC test scripts automatically. We analyze 233 IaC Ansible test scripts qualitatively and derive five anti-pattern categories: Assertion Confusion, Mystery Guest, Local-only Testing, Leftover Installation, Disabled Lint Checking. Among these, the last three are unique to IaC. Then we build a tool Test Linter for Infrastructure as Code (TELIC) and analyze 5019 test scripts from 378 OSS repositories and detect 1662 anti-patterns there. We apply closed coding to check the performance of TELIC, perform a survey to check practitioners' agreement on the anti-pattern categories, and submit bug reports on the Github of the XX OSS repositories to check practitioners' perception. We observe recall of TELIC is more than 0.95 on our oracle dataset among all the categories. From practitioners' responses, we also get strong agreement on the identified categories.

akondrahman commented 3 years ago

Did Greg participated in the survey and agreed to any of our findings?

Greg didn't respond to my reply. Did we discard the plan of submitting bug report? Yes. We will send out surveys

akondrahman commented 3 years ago

@Talismanic when is a good time to participate in reading group for you? I will try my best to accomodate you.

Talismanic commented 3 years ago

Did we discard the plan of submitting bug report?

Yes. We will send out surveys

Bhaiya, if it is about time, I can spend some time to do the bug reporting and obtain the result. You can share on sample bug report with me you did for The 7-Sins paper.

Talismanic commented 3 years ago

@Talismanic when is a good time to participate in reading group for you? I will try my best to accomodate you.

Bhaiya, Here are some time proposal:

14th-Jan-2020: 11 pm Dhaka time == 11 am CST 15th-Jan-2020: 11 pm Dhaka time == 11 am CST 18th-Jan-2020: 11 pm Dhaka time == 11 am CST (same for the rest of the week).

akondrahman commented 3 years ago

Give me some dates in Feb. Classes don't start till Jan 19.

akondrahman commented 3 years ago

@Talismanic

Here are some example bug reports:

https://github.com/voxpupuli/puppet-unbound/issues/183 https://github.com/deric/puppet-mesos/issues/88 https://github.com/cookbooks/ic-mongodb/pull/1

Talismanic commented 3 years ago

Give me some dates in Feb. Classes don't start till Jan 19.

Bhaiya, I should be available at the similar times in February (11 pm dhaka == 11 am CST)

Talismanic commented 3 years ago

@Talismanic

Here are some example bug reports:

voxpupuli/puppet-unbound#183 deric/puppet-mesos#88 cookbooks/ic-mongodb#1

I will update you about this by tonight.

Talismanic commented 3 years ago

@akondrahman Bhaiya, I have submitted one bug on Local Only Testing here: https://github.com/rocknsm/rock/issues/553. If the format is ok, I will gradually submit more bugs.

akondrahman commented 3 years ago

I am a infrastructure as code researcher

Say you are a developer/engineer. Devs tend to ignore researchers

akondrahman commented 3 years ago

@Talismanic ... out of curiosity, do you use or have used model-based testing at Grameen Phone? Based on your experience do you think model-based testing is a good fit for Ansible tests?

Talismanic commented 3 years ago

@Talismanic ... out of curiosity, do you use or have used model-based testing at Grameen Phone? Based on your experience do you think model-based testing is a good fit for Ansible tests?

No Bhaiya.

Honestly Bhaiya, I have not done IaC by hands since last Mar-20 as our DevOps team is now matured enough to handle those by themselves. I only provide consultation to them time to time when they are stuck.

akondrahman commented 3 years ago

What are you working on?

Talismanic commented 3 years ago

What are you working on?

Bhaiya, currently my work mainly revolves around i) Supporting two development teams to run regular sprints as Lechlead ii) Supporting one development team to decompose a mamoth monolith to microservice.
iii) Time to time consultation support to common Platforms Team

akondrahman commented 3 years ago

Supporting one development team to decompose a mamoth monolith to microservice.

We should talk about this sometime. I am thinking of novel testing techniques for IaC, and that is why I asked about model-based testing. I might bounce some of my ideas with you some time later.

akondrahman commented 3 years ago

@Talismanic In your experience have you come across performance-related bugs for Ansible playbooks?

Talismanic commented 3 years ago

@Talismanic In your experience have you come across performance-related bugs for Ansible playbooks?

Bhaiya, I am afraid that I need more clarifications about the performance related bugs. Did you mean the bugs which create performance issue of the production software? Or did you mean the bugs which creates performance issues to the ansible scripts (e.g more resource/timr consumption)?