Data for TAMI #17

Closed Talismanic closed 3 years ago

the raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab. I will do the filtering of Python and TOX files myself and plugin the results.
The commit count, test-related commit count i.e. count of commits that modified YAML test files, total Ansible scripts, total test scripts, and duration of all repos for the three datasets: Openstack, GitHub, and Gitlab
Only YAML scripts from the three datasets: Openstack, GitHub, and GitLab. Please preserve the directory structure, otherwise I can’t map with your CSV results.
The examples that I asked before: a. I want a lot of setup and no cleanup.
200 Scripts to submit github issues

@Talismanic

I need the Openstack and GitHub YAML ZIP again. Seems like you have added more repos. I am expecting a ZIP file of 495 scripts for Openstack and 4942 scripts for GitHub, preserving the whole directory structure. Without this I can't plugin the smell density values and count per play values. Here is the structure:

ZIP |- Openstack |------subdir1 |----------subdir1/subsubdir1

Just completing the RQ2 is taking 1 week! Hope this loop will close soon.

Bhaiya I rechecked the openstack repos and github repos. Rearranged the folders and shared in part by part zip. I am extremely sorry for the back and forts. :(

@Talismanic

No need to apologize. I understand that you have other commitments.

Thanks for sharing all that data. We are still missing TAMI's accuracy evaluation. For that I want you to run TAMI on the scripts that you gave Brinto. From the output file I will calculate precision and recall for TAMI and plug the results in. You will send me the CSV and the folder in which the scripts that you gave Brinto reside.

@akondrahman Bhai, One observation. At the time Brinto labeled the data, we had python scripts in our data set. But now we have only considered yml scripts. We only have 18 yml scripts there. Should it be a concern?

If it is a concern then 1 proposal can be taking some more yml scripts from our latest open stack dataset.

Also Bhaiya, I am a bit confused about how you will use these zip files. If you want to match the TAMI raw output csv with these zip folders, it will be tough for you. As output csv has been running in the plain directories (without Github/Gitlab etc. subdirectories). I have the flat folder structure in zipped mode. If you want, I can share that and from there you will be able to match the exact file location from TAMI output.

@Talismanic

You are right about the Oracle dataset. 18 is not enough. I would like you to create an oracle dataset with 100 scripts: 18 from the old one, 7 scripts from the new Openstack repos, and 75 from the GitHub repos. Run TAMI, send em the CSV output and the scripts as a Zip file. I will assign a graduate student at Tenn. Tech to do the labeling.

Run TAMI, send em the CSV output and the scripts as a Zip file

Bhaiya, In the attached zip file you will find a folder named \tests where all the files are aggregrated. I ran TAMI on that folder and the result is summarized in file Oracle_Rating_by_TAMI.xlsx. Another file is there named Oracle_files.xlsx where there is no rating for the similar file names. In this file, rater can put the findings. oracle.zip

@akondrahman Bhai, Please let me know what will be my next task. :)

Great job @Talismanic ... can you now please send me the sample of YAML scripts with GitHub repo links to submit bug reports. I need 200 of them from the GitHub dataset

@akondrahman Bhai, Unfortunately we have 108 unique Github Repo, where at least 1 antipattern is found. For Gitlab & Openstack count is 20 & 28

No worries. Just give me 200 instances of anti-patterns that belong to five categories. It is OK if we get 10-20 instances from one repo.

@akondrahman Bhaiya, I added one more condition while finding the repo. Last commit date should be within 2020. Otherwise those repos may be inactive.

Facing some difficulty to match exactly 200. My script is not stopping somehow. So total collected instance became 253 from 25 repositories. Please check whether attached file format will do, or I need to modify that.

repo_list.txt

Looks good. Thanks!

@akondrahman Bhaiya, I have some query and input while I was reading the paper:

TELIC identifies a test script to include a test play if a play within a script includes (i) one of the following keywords: ‘check’, ‘determine’, ‘ensure’,‘test’, ‘validate’, and ‘verify’. Actually TELIC classifies a test script if it is under "tests" directory and have yml/yaml extension.
I have not determined Total Test Plays and LOC. Have you calculated those bhaiya from the scripts?
For selecting the oracle dataset I used RAND() function of MySQL to detect 100 random scripts from our anti-pattern database.

In Listing 6, our example of adding the yum repositories from external URL is actually the only way adding new repository. But if we do installation of a package from external repository that will be anti-pattern. For example, following is a hypothetical example of anti-pattern:


- name: Downloading nginx rpm,
  get_url:
    url: http://nginx.org/packages/centos/{{ansible_distribution_major_version}}/noarch/RPMS/nginx-release-centos-{{ansible_distribution_major_version}}-0.el{{ansible_distribution_major_version}}.ngx.noarch.rpm
    dest: /tmp/ngx.noarch.rpm

- name: Install nginx
   yum:
        name: /tmp/ngx.noarch.rpm
        state: present

The right way to do this would have been simply taking full advantage fro yum module:

name: install nginx yum: name=nginx state=present ```

^ Put the above discussion in a new issue and assign it to me.

Done with this issue. Closing.

akondrahman / IaCTesting

Data for TAMI #17