akondrahman / IaCTesting

Placeholder for the research study related to IaC testing anti-patterns
3 stars 256 forks source link

Data for TAMI #17

Closed Talismanic closed 3 years ago

Talismanic commented 3 years ago
  1. the raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab. I will do the filtering of Python and TOX files myself and plugin the results.
  2. The commit count, test-related commit count i.e. count of commits that modified YAML test files, total Ansible scripts, total test scripts, and duration of all repos for the three datasets: Openstack, GitHub, and Gitlab
  3. Only YAML scripts from the three datasets: Openstack, GitHub, and GitLab. Please preserve the directory structure, otherwise I can’t map with your CSV results.
  4. The examples that I asked before: a. I want a lot of setup and no cleanup.
  5. 200 Scripts to submit github issues
Talismanic commented 3 years ago

@akondrahman Bhaiya,

For 1, what count we need? Are you referring to the Table 7,8,9,10 ?

akondrahman commented 3 years ago

No only Table 9. The count of anti-patterns for each category. Before you give me the category, TAMI needs to be adjusted for environment cleanup. Someone can clean up by using a dedicated task or role. So we need to check the tag task and role and see if the keyword clean or teardown appears. For example as down in this blog post: https://janikvonrotz.ch/2018/02/26/working-with-ansible-cleanup-tasks/

Let me know if you have questions @Talismanic

Talismanic commented 3 years ago

So we need to check the tag task and role and see if the keyword clean or teardown appears.

Implementing this with priority.

Talismanic commented 3 years ago

@akondrahman Bhaiya, changes have been accomodated. This change now actually demands to run TAMI on all the dataset. So I am going to run TAMI in the batch runner mode and update the counts. This may take until tomorrow.

akondrahman commented 3 years ago

Thanks @Talismanic for all the hard work. Send me the CSVs when all results are ready. In the meantime can you send me a ZIP file with all YAML test scripts for the three datasets ?

Talismanic commented 3 years ago

In the meantime can you send me a ZIP file with all YAML test scripts for the three datasets ?

Bhaiya, do you need those files in any structured way or only the dump of all files will do?

akondrahman commented 3 years ago

I want it like this: ZIP |- Openstack |------subdir1 |----------subdir1/subsubdir1 |- GitHub |------subdir1 |----------subdir1/subsubdir1 |- GitLab |------subdir1 |----------subdir1/subsubdir1

Please preserve the structure and path so that I can map it to the CSV results file . I want the raw YAML files to gain further empirical insights, if any

akondrahman commented 3 years ago

@Talismanic

I have handled this ... needed some tweak in writing ... so no need to work on the following:

The examples that I asked before: a. I want a lot of setup and no cleanup.

akondrahman commented 3 years ago

@Talismanic

So far I have finished writing Background and Related Work, RQ1 ... I need the following data that I asked for to proceed further:

  1. The raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab.

  2. Only YAML scripts from the three datasets: Openstack, GitHub, and GitLab.

Can I expect them in the next 12 hours or so?

Talismanic commented 3 years ago

@Talismanic

So far I have finished writing Background and Related Work, RQ1 ... I need the following data that I asked for to proceed further:

  1. The raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab.

  2. Only YAML scripts from the three datasets: Openstack, GitHub, and GitLab.

Can I expect them in the next 12 hours or so?

I am working on these bhaiya. As I could not store any repo locally, overall thing is taking some time. But hopefully it will be finished by next 12 hours.

Talismanic commented 3 years ago

Dear @akondrahman Bhai, There are some new situations after I excluded the python files and added the new logic in the No Env Clean Up. The summary is, our data count has been significantly lowered. This is the cumulative data for GitLab and GitHub. I did not yet finish to separate those. Doesn't this number look very small?

anti-pattern name project count file count total count
Skip Ansible Lint 6 6 22
Local Only Test 25 35 40
Assertion Roultette 2 2 2
External Dependency 8 26 45
No Env Clean Up 166 2214 2214
For your comparison, the previous count was: anti-pattern name project count file count total count
Skip Ansible Lint 6 6 22
Local Only Test 25 35 40
Assertion Roultette 123 4461 38629
External Dependency 92 1501 7763
No Env Clean Up 229 9784 9784
Talismanic commented 3 years ago

@akondrahman Bhaiya, Total Count for each category of anti-patterns.

  1. The raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab.
Antipattern Name Github Gitlab Openstack
Skip Ansible Lint 22 0 18
Local Only Test 37 3 16
Assertion Roultette 2 0 3
External Dependency 45 0 18
No Env Clean Up 2164 50 96

I have one observation here. Many of the Openstack repo is also available in Github Repo set.

akondrahman commented 3 years ago

I have one observation here. Many of the Openstack repo is also available in Github Repo set.

To handle this do not include the Openstack repos in the GitHun repo set. So no Openstack data in GitHub data.

As discussed in issue #18 , you need to redo the analysis for GitLab as you will be collecting more test.yml files.

Doesn't this number look very small?

Don't worry about the numbers now. Our job as researchers is to report accurate scientific results. We should not do anything to make results look good.

@Talismanic ... when can I get the stuff that I needed? Today is Christmas day and my whole day is open to work on your paper :)

Talismanic commented 3 years ago

@Talismanic ... when can I get the stuff that I needed? Today is Christmas day and my whole day is open to work on your paper :)

Bhaiya, I could not automate the whole process of cleaning other files and keeping the structure same as the original. So I am cherrypicking the repositories. Till now I could complete clearing 59 repositories out of 166. I am attaching those here. I am working rigorously to get the rest done as early as possible.

mined_repos-Copy.zip

akondrahman commented 3 years ago

Thanks for the update. If it is easier on you you can give me all repos without filtering and I can do the filtering myself.

Talismanic commented 3 years ago

Thanks for the update. If it is easier on you you can give me all repos without filtering and I can do the filtering myself.

Bhaiya, I have some db setup and some dirty PowerShell scripts to clean up. For you, it will be a little bit troublesome to start from scratch. Please allow me some time. I will finish it inshallah. Also, when I cross the landmark of 100, I will share one more zip with you.

akondrahman commented 3 years ago

OK. I will wait. Thanks for all the hard work.

akondrahman commented 3 years ago

Also, when I cross the landmark of 100, I will share one more zip with you.

Send me by dataset: first Openstack, then GitLab, and then GitHub if possible. I also do not have the full anti-pattern count dataset for GitHub, GitLab, and Openstack.

Talismanic commented 3 years ago

@akondrahman Bhai, Unfortunately, I started with Github & Gitlab first. Dataset for Gitlab & Github is ready. I uploaded those in the below link.

https://drive.google.com/file/d/1QYKnLVzRV-taTm6k3PgQjKcNnV1t1PZ3/view?usp=sharing

I am working on OpenStack..

Also, the full anti-pattern count dataset is attached here. There are two files, one for github+gitlab and another is for openstack. antipatterns.zip

Talismanic commented 3 years ago

@akondrahman Bhaiya, I have made a mistaked. While running TAMI on openstack data, I was in a branch where python codes were not excluded. So the data is erroneous. I have rectified and here is the updated openstack anti-pattern data. I have also update counts the above comment.

openstack_anti-pattern_data.zip

akondrahman commented 3 years ago

Thanks @Talismanic !

Two issues:

Talismanic commented 3 years ago
  • For the GitLab output, how do I separate GitHub and GitLab output? Using repo_type =2 ?
  1. yes bhaiya. repo_type=1 means Github and repo_type=2 means Gitlab
  2. Google drive has blocked this zip saying that it possess data which might be subject to policy violation. Let me try some different sharing mechnism.
Talismanic commented 3 years ago

GitHub Part 1 Github-1.zip

Talismanic commented 3 years ago

Github Part 2 Github-2.zip

Talismanic commented 3 years ago

Github Part 3 Github-3.zip

Talismanic commented 3 years ago

Gitlab Gitlab.zip

akondrahman commented 3 years ago

Thanks a whole bunch ... I think I can start writing RQ2 of the paper.

Talismanic commented 3 years ago

Open Stack repos: open-stack-new-repos.zip

Talismanic commented 3 years ago

@akondrahman Bhai,

Now I still have 2 action points:

  1. The commit count, test-related commit count i.e. count of commits that modified YAML test files, total Ansible scripts, total test scripts, and duration of all repos for the three datasets: Openstack, GitHub, and Gitlab
  2. 200 Scripts to submit github issues

I will start working on these tomorrow.

Talismanic commented 3 years ago

@akondrahman Bhai, I need some help for the below queries:

  1. If we count total yml files as the total ansible script count, that might give some over-estimation. We can narrow the filtering by providing a folder filter. Like for example, there should be tasks folder for holding tasks in each of the ansible repo. If we count yml/yaml files of tasks folder, we can get the files which are actually ansible files. However, if someone does not follow the standard, then we will have the data which is underestimation.
  2. For 200 scripts to post bug, how many repository should we aim? minimum 80/100 ?
akondrahman commented 3 years ago
  1. I agree.
  2. I think sth. between 40-50 repos is fine. 50 is the max. 30 can be minimum.
akondrahman commented 3 years ago

Before attempting 2 you need to address issue #19 ... this will change the number of anti-pattern count. Will you be done with 1 in the next 2-3 hours, @Talismanic ?

akondrahman commented 3 years ago

@Talismanic

I will have good amount of time till Dec 31 to work on your paper. If you can send me the data that I requested in the next 24 hours then that would allow me to finish off the writing for RQ2, RQ3, and Discussion. After Jan 01 I will be busy with other papers and university activities.

Talismanic commented 3 years ago

Before attempting 2 you need to address issue #19 ... this will change the number of anti-pattern count. Will you be done with 1 in the next 2-3 hours, @Talismanic ?

Sorry bhaiya, I was not available last night. The work can be done within 2-3 hours.

Talismanic commented 3 years ago

Point 2: WIP (mining on going)

Metric Github Gitlab Openstack
commit count 700 k 8.2 k 258 k
test-related commit count 276 k 6 k 43.6k
total Ansible scripts 66.4 k 2 k 11.2 k
total test scripts 5.2 k 52 511
avg duration of all repos in month 43 12 75
Talismanic commented 3 years ago

@akondrahman Bhaiya, For point 2 I am facing a dilemma to count the test_related_commits

  1. When I am counting the commits for ".yml" files, apparently many commits are coming which are not relates to testing. Even files where travis or circle-ci integration changes, are also coming. So this is one kind of a huge overestimation.
  2. I tried another way, counting the commits where the change files path contains "tests" substring. In this manner, I am seeing a lesser commit count. But the problem is this count contains non-IaC commit counts also. Like normal python scripts. So this is also one kind of overestimation
  3. Another approach just came to my mind that I can combine option 1 and option 2 in an AND condition which might give a better estimate.

Which option should I follow?

akondrahman commented 3 years ago

I think it is better to ignore test-related commit count Just calculate Ansible-related commit.

Talismanic commented 3 years ago

Just calculate Ansible-related commit.

Ok. For that I think approach 1 (counting yml) is sufficient. Scripts are running to extract that Bhaiya. Estimated time of completion is around 8 hours . :(

akondrahman commented 3 years ago

OK ... I will wait. In the mean time, if you can update the cleanup algorithm in TAMI. Give me the new CSVs for the three datasets when ready.

akondrahman commented 3 years ago

Once the results are ready, let me know @Talismanic

Talismanic commented 3 years ago

@akondrahman Bhai, Updated raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab.

Antipattern Name Github Gitlab Openstack
Skip Ansible Lint 3 0 19
Local Only Test 19 3 18
Assertion Roultette 1 0 1
External Dependency 25 0 20
No Env Clean Up 42 5 9

Attaching the raw count file. repo_type=3 means openstack.

iac_anti_patterns.zip

Calculating more data Bhaiya.

Talismanic commented 3 years ago

@akondrahman bhai, Rest of the metrics:

Metric Github Gitlab Openstack
Total Repos 324 91 54
Total Projects 347 92 49
commit count 700696 8219 258523
ansible-related commit count 276104 6090 43649
total Ansible scripts 66400 2065 11233
total test scripts 5198 52 511
avg duration of all repos in month 43 12 75
Talismanic commented 3 years ago

@akondrahman Bhai, As I have considered all the Openstack repo out of Github, table 7 data will be updated. Updates for that will be:

Data for Table 7:

Type Openstack Github Gitlab
Initial Count 1253 3405k NA
Criteria-1 (Ansible Script) 96 6633 8194
Criteria-2 (Not a fork) 96 4147 7512
Criteria-3 (Contributor Count3) 94 856 546
Criteria-4 (Commits/Month >=2) 90 770 332
Criteria-5 (Lifetime>1month) 90 675 279
Criteria-6 (10% iac script) 54 325 91
Talismanic commented 3 years ago

@akondrahman Bhai, Small count of External Dependency has surprised me a bit as I saw many external dependencies when I was sorting the yml files for you manually. I reviewed the code and found that I had made a mistake while detecting URLs in the test scripts. After fixing that, I am seeing a soaring increase of this anti-pattern. Revised count will be:

Antipattern Name Github Gitlab Openstack
Skip Ansible Lint 3 0 19
Local Only Test 19 3 18
Assertion Roultette 1 0 1
External Dependency 765 8 125
No Env Clean Up 42 5 9

Attaching the raw count: iac_anti_patterns.zip

I sincerely apologize for this kind of mistakes. I am also going to review other methods whether there is any logic level mistake still present.

Talismanic commented 3 years ago

@akondrahman Bhaiya, I am done with code review and logic checking. Also I implemented check for the codebases where explicit roles are not used in the scripts. After those, I found Local Only Test and Assertion Roulette count has increased significantly.

This is expected as the coverage of TAMI increased after handling the role-less scripts.

Antipattern Name Github Gitlab Openstack
Skip Ansible Lint 3 0 19
Local Only Test 245 30 19
Assertion Roultette 527 3 1
External Dependency 757 8 133
No Env Clean Up 42 5 9

Updated raw count.

iac_anti_patterns.zip

I think I am done with the counting and data.

akondrahman commented 3 years ago

Thanks for the hard work. I will plugin the results.

akondrahman commented 3 years ago

@Talismanic

I need the Openstack and GitHub YAML ZIP again. Seems like you have added more repos. I am expecting a ZIP file of 495 scripts for Openstack and 4942 scripts for GitHub, preserving the whole directory structure. Without this I can't plugin the smell density values and count per play values. Here is the structure:

ZIP |- Openstack |------subdir1 |----------subdir1/subsubdir1

Just completing the RQ2 is taking 1 week! Hope this loop will close soon.

akondrahman commented 3 years ago

@akondrahman bhai, Rest of the metrics:

Metric Github Gitlab Openstack Total Repos 324 91 54 Total Projects 347 92 49 commit count 700 k 8.2 k 258 k ansible-related commit count 276 k 6 k 43.6k total Ansible scripts 66.4 k 2 k 11.2 k total test scripts 5.2 k 52 511 avg duration of all repos in month 43 12 75

@Talismanic I need full and accurate number here: 8.2 K , 6K will not work. Please update the table with full values not abbreviations.

Talismanic commented 3 years ago

@Talismanic I need full and accurate number here: 8.2 K , 6K will not work. Please update the table with full values not abbreviations.

Done Bhaiya.

akondrahman commented 3 years ago

Thanks @Talismanic . I will wait on the YAML files ... I need the YAML files to calculate the anti-pattern density metric and the count per play metric. When will the YAML files be ready? All you need to do is dump all YAML scripts by maintaining the directory structure, is that right?