Data for TAMI #17

Closed Talismanic closed 3 years ago

the raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab. I will do the filtering of Python and TOX files myself and plugin the results.
The commit count, test-related commit count i.e. count of commits that modified YAML test files, total Ansible scripts, total test scripts, and duration of all repos for the three datasets: Openstack, GitHub, and Gitlab
Only YAML scripts from the three datasets: Openstack, GitHub, and GitLab. Please preserve the directory structure, otherwise I can’t map with your CSV results.
The examples that I asked before: a. I want a lot of setup and no cleanup.
200 Scripts to submit github issues

@akondrahman Bhaiya,

For 1, what count we need? Are you referring to the Table 7,8,9,10 ?

No only Table 9. The count of anti-patterns for each category. Before you give me the category, TAMI needs to be adjusted for environment cleanup. Someone can clean up by using a dedicated task or role. So we need to check the tag task and role and see if the keyword clean or teardown appears. For example as down in this blog post: https://janikvonrotz.ch/2018/02/26/working-with-ansible-cleanup-tasks/

Let me know if you have questions @Talismanic

So we need to check the tag task and role and see if the keyword clean or teardown appears.

Implementing this with priority.

@akondrahman Bhaiya, changes have been accomodated. This change now actually demands to run TAMI on all the dataset. So I am going to run TAMI in the batch runner mode and update the counts. This may take until tomorrow.

Thanks @Talismanic for all the hard work. Send me the CSVs when all results are ready. In the meantime can you send me a ZIP file with all YAML test scripts for the three datasets ?

In the meantime can you send me a ZIP file with all YAML test scripts for the three datasets ?

Bhaiya, do you need those files in any structured way or only the dump of all files will do?

Please preserve the structure and path so that I can map it to the CSV results file . I want the raw YAML files to gain further empirical insights, if any

@Talismanic

I have handled this ... needed some tweak in writing ... so no need to work on the following:

The examples that I asked before: a. I want a lot of setup and no cleanup.

@Talismanic

So far I have finished writing Background and Related Work, RQ1 ... I need the following data that I asked for to proceed further:

The raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab.
Only YAML scripts from the three datasets: Openstack, GitHub, and GitLab.

Can I expect them in the next 12 hours or so?

@Talismanic

So far I have finished writing Background and Related Work, RQ1 ... I need the following data that I asked for to proceed further:

The raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab.

Only YAML scripts from the three datasets: Openstack, GitHub, and GitLab.

Can I expect them in the next 12 hours or so?

I am working on these bhaiya. As I could not store any repo locally, overall thing is taking some time. But hopefully it will be finished by next 12 hours.

Dear @akondrahman Bhai, There are some new situations after I excluded the python files and added the new logic in the No Env Clean Up. The summary is, our data count has been significantly lowered. This is the cumulative data for GitLab and GitHub. I did not yet finish to separate those. Doesn't this number look very small?

anti-pattern name	project count	file count	total count
Skip Ansible Lint	6	6	22
Local Only Test	25	35	40
Assertion Roultette	2	2	2
External Dependency	8	26	45
No Env Clean Up	166	2214	2214

For your comparison, the previous count was:	anti-pattern name	project count	file count
Skip Ansible Lint	6	6	22
Local Only Test	25	35	40
Assertion Roultette	123	4461	38629
External Dependency	92	1501	7763
No Env Clean Up	229	9784	9784

@akondrahman Bhaiya, Total Count for each category of anti-patterns.

The raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab.

Antipattern Name	Github	Gitlab	Openstack
Skip Ansible Lint	22	0	18
Local Only Test	37	3	16
Assertion Roultette	2	0	3
External Dependency	45	0	18
No Env Clean Up	2164	50	96

I have one observation here. Many of the Openstack repo is also available in Github Repo set.

I have one observation here. Many of the Openstack repo is also available in Github Repo set.

To handle this do not include the Openstack repos in the GitHun repo set. So no Openstack data in GitHub data.

As discussed in issue #18 , you need to redo the analysis for GitLab as you will be collecting more test.yml files.

Doesn't this number look very small?

Don't worry about the numbers now. Our job as researchers is to report accurate scientific results. We should not do anything to make results look good.

@Talismanic ... when can I get the stuff that I needed? Today is Christmas day and my whole day is open to work on your paper :)

@Talismanic ... when can I get the stuff that I needed? Today is Christmas day and my whole day is open to work on your paper :)

Bhaiya, I could not automate the whole process of cleaning other files and keeping the structure same as the original. So I am cherrypicking the repositories. Till now I could complete clearing 59 repositories out of 166. I am attaching those here. I am working rigorously to get the rest done as early as possible.

mined_repos-Copy.zip

Thanks for the update. If it is easier on you you can give me all repos without filtering and I can do the filtering myself.

Thanks for the update. If it is easier on you you can give me all repos without filtering and I can do the filtering myself.

Bhaiya, I have some db setup and some dirty PowerShell scripts to clean up. For you, it will be a little bit troublesome to start from scratch. Please allow me some time. I will finish it inshallah. Also, when I cross the landmark of 100, I will share one more zip with you.

OK. I will wait. Thanks for all the hard work.

Also, when I cross the landmark of 100, I will share one more zip with you.

Send me by dataset: first Openstack, then GitLab, and then GitHub if possible. I also do not have the full anti-pattern count dataset for GitHub, GitLab, and Openstack.

@akondrahman Bhai, Unfortunately, I started with Github & Gitlab first. Dataset for Gitlab & Github is ready. I uploaded those in the below link.

https://drive.google.com/file/d/1QYKnLVzRV-taTm6k3PgQjKcNnV1t1PZ3/view?usp=sharing

I am working on OpenStack..

Also, the full anti-pattern count dataset is attached here. There are two files, one for github+gitlab and another is for openstack. antipatterns.zip

@akondrahman Bhaiya, I have made a mistaked. While running TAMI on openstack data, I was in a branch where python codes were not excluded. So the data is erroneous. I have rectified and here is the updated openstack anti-pattern data. I have also update counts the above comment.

openstack_anti-pattern_data.zip

Thanks @Talismanic !

Two issues:

For the GitLab output, how do I separate GitHub and GitLab output? Using repo_type =2 ?
I don't have access to the Google Drive folder? Can you please re share with akond.rahman.buet@gmail.com ?

For the GitLab output, how do I separate GitHub and GitLab output? Using repo_type =2 ?

yes bhaiya. repo_type=1 means Github and repo_type=2 means Gitlab
Google drive has blocked this zip saying that it possess data which might be subject to policy violation. Let me try some different sharing mechnism.

GitHub Part 1 Github-1.zip

Github Part 2 Github-2.zip

Github Part 3 Github-3.zip

Gitlab Gitlab.zip

Thanks a whole bunch ... I think I can start writing RQ2 of the paper.

Open Stack repos: open-stack-new-repos.zip

@akondrahman Bhai,

Now I still have 2 action points:

The commit count, test-related commit count i.e. count of commits that modified YAML test files, total Ansible scripts, total test scripts, and duration of all repos for the three datasets: Openstack, GitHub, and Gitlab
200 Scripts to submit github issues

I will start working on these tomorrow.

@akondrahman Bhai, I need some help for the below queries:

If we count total yml files as the total ansible script count, that might give some over-estimation. We can narrow the filtering by providing a folder filter. Like for example, there should be tasks folder for holding tasks in each of the ansible repo. If we count yml/yaml files of tasks folder, we can get the files which are actually ansible files. However, if someone does not follow the standard, then we will have the data which is underestimation.
For 200 scripts to post bug, how many repository should we aim? minimum 80/100 ?

I agree.
I think sth. between 40-50 repos is fine. 50 is the max. 30 can be minimum.

Before attempting 2 you need to address issue #19 ... this will change the number of anti-pattern count. Will you be done with 1 in the next 2-3 hours, @Talismanic ?

@Talismanic

I will have good amount of time till Dec 31 to work on your paper. If you can send me the data that I requested in the next 24 hours then that would allow me to finish off the writing for RQ2, RQ3, and Discussion. After Jan 01 I will be busy with other papers and university activities.

Before attempting 2 you need to address issue #19 ... this will change the number of anti-pattern count. Will you be done with 1 in the next 2-3 hours, @Talismanic ?

Sorry bhaiya, I was not available last night. The work can be done within 2-3 hours.

Point 2: WIP (mining on going)

Metric	Github	Gitlab	Openstack
commit count	700 k	8.2 k	258 k
test-related commit count	276 k	6 k	43.6k
total Ansible scripts	66.4 k	2 k	11.2 k
total test scripts	5.2 k	52	511
avg duration of all repos in month	43	12	75

@akondrahman Bhaiya, For point 2 I am facing a dilemma to count the test_related_commits

When I am counting the commits for ".yml" files, apparently many commits are coming which are not relates to testing. Even files where travis or circle-ci integration changes, are also coming. So this is one kind of a huge overestimation.
I tried another way, counting the commits where the change files path contains "tests" substring. In this manner, I am seeing a lesser commit count. But the problem is this count contains non-IaC commit counts also. Like normal python scripts. So this is also one kind of overestimation
Another approach just came to my mind that I can combine option 1 and option 2 in an AND condition which might give a better estimate.

Which option should I follow?

I think it is better to ignore test-related commit count Just calculate Ansible-related commit.

Just calculate Ansible-related commit.

Ok. For that I think approach 1 (counting yml) is sufficient. Scripts are running to extract that Bhaiya. Estimated time of completion is around 8 hours . :(

OK ... I will wait. In the mean time, if you can update the cleanup algorithm in TAMI. Give me the new CSVs for the three datasets when ready.

Once the results are ready, let me know @Talismanic

@akondrahman Bhai, Updated raw count from TAMI for the three datasets: Openstack, GitHub, and GitLab.

Antipattern Name	Github	Gitlab	Openstack
Skip Ansible Lint	3	0	19
Local Only Test	19	3	18
Assertion Roultette	1	0	1
External Dependency	25	0	20
No Env Clean Up	42	5	9

Attaching the raw count file. repo_type=3 means openstack.

iac_anti_patterns.zip

Calculating more data Bhaiya.

@akondrahman bhai, Rest of the metrics:

Metric	Github	Gitlab	Openstack
Total Repos	324	91	54
Total Projects	347	92	49
commit count	700696	8219	258523
ansible-related commit count	276104	6090	43649
total Ansible scripts	66400	2065	11233
total test scripts	5198	52	511
avg duration of all repos in month	43	12	75

@akondrahman Bhai, As I have considered all the Openstack repo out of Github, table 7 data will be updated. Updates for that will be:

Data for Table 7:

Type	Openstack	Github	Gitlab
Initial Count	1253	3405k	NA
Criteria-1 (Ansible Script)	96	6633	8194
Criteria-2 (Not a fork)	96	4147	7512
Criteria-3 (Contributor Count3)	94	856	546
Criteria-4 (Commits/Month >=2)	90	770	332
Criteria-5 (Lifetime>1month)	90	675	279
Criteria-6 (10% iac script)	54	325	91

@akondrahman Bhai, Small count of External Dependency has surprised me a bit as I saw many external dependencies when I was sorting the yml files for you manually. I reviewed the code and found that I had made a mistake while detecting URLs in the test scripts. After fixing that, I am seeing a soaring increase of this anti-pattern. Revised count will be:

Antipattern Name	Github	Gitlab	Openstack
Skip Ansible Lint	3	0	19
Local Only Test	19	3	18
Assertion Roultette	1	0	1
External Dependency	765	8	125
No Env Clean Up	42	5	9

Attaching the raw count: iac_anti_patterns.zip

I sincerely apologize for this kind of mistakes. I am also going to review other methods whether there is any logic level mistake still present.

@akondrahman Bhaiya, I am done with code review and logic checking. Also I implemented check for the codebases where explicit roles are not used in the scripts. After those, I found Local Only Test and Assertion Roulette count has increased significantly.

This is expected as the coverage of TAMI increased after handling the role-less scripts.

Antipattern Name	Github	Gitlab	Openstack
Skip Ansible Lint	3	0	19
Local Only Test	245	30	19
Assertion Roultette	527	3	1
External Dependency	757	8	133
No Env Clean Up	42	5	9

Updated raw count.

iac_anti_patterns.zip

I think I am done with the counting and data.

Thanks for the hard work. I will plugin the results.

@Talismanic

I need the Openstack and GitHub YAML ZIP again. Seems like you have added more repos. I am expecting a ZIP file of 495 scripts for Openstack and 4942 scripts for GitHub, preserving the whole directory structure. Without this I can't plugin the smell density values and count per play values. Here is the structure:

ZIP |- Openstack |------subdir1 |----------subdir1/subsubdir1

Just completing the RQ2 is taking 1 week! Hope this loop will close soon.

@akondrahman bhai, Rest of the metrics:

Metric Github Gitlab Openstack Total Repos 324 91 54 Total Projects 347 92 49 commit count 700 k 8.2 k 258 k ansible-related commit count 276 k 6 k 43.6k total Ansible scripts 66.4 k 2 k 11.2 k total test scripts 5.2 k 52 511 avg duration of all repos in month 43 12 75

@Talismanic I need full and accurate number here: 8.2 K , 6K will not work. Please update the table with full values not abbreviations.

@Talismanic I need full and accurate number here: 8.2 K , 6K will not work. Please update the table with full values not abbreviations.

Done Bhaiya.

Thanks @Talismanic . I will wait on the YAML files ... I need the YAML files to calculate the anti-pattern density metric and the count per play metric. When will the YAML files be ready? All you need to do is dump all YAML scripts by maintaining the directory structure, is that right?

akondrahman / IaCTesting

Data for TAMI #17