Closed akondrahman closed 3 years ago
RQ1: How frequently are tests conducted for Ansible scripts?
Already completed.
RQ2: How frequently do bugs appear in test code for Ansible scripts?
Steps:
error, bug, fix, issue, mistake, incorrect, fault, defect and flaw
repo_path
, test_script
, commit_hash
, commit_message
, bug_flag
. Create this file for both datasets: GitHub and Openstack. @Talismanic ^
Reff to keywords: https://web.cs.ucdavis.edu/~filkov/papers/lang_github.pdf
4. Create a file, that includes the headers:
repo_path
,test_script
,commit_hash
,commit_message
,bug_flag
. Create this file for both datasets: GitHub and Openstack
Bhaiya, I have prepared the dataset. Please find in the attachment. As commit_message has ",", I had to choose "|" separated csv file to store the data.
Summary of the data:
Github & Openstack difference:
Criteria | Github | OpenStack | Total |
---|---|---|---|
Repo Count | 67 | 24 | 91 |
File Count | 1291 | 244 | 1535 |
Commit Count | 615 | 347 | 962 |
Project Wise file count(median) | 2 | 3 | |
Project Wise commit count (median) | 4 | 6.5 | |
file wise commit count (median) | 1 | 2 | |
Project Wise file count(mean) | 16.87 | 10.2 | |
Project Wise commit count (mean) | 13.5 | 14.5 | |
file wise commit count (mean) | 1.9 | 2.44 |
project_wise_commit_count.zip project_wise_file_count.zip commit_messages.zip file_wise_commit_count.zip
Bhaiya, please suggest if we need to examine the data or find more metrics.
3. How frequently are tests modified for Ansible scripts? What code elements are modified? Why are the code elements modified?
Proposed mechanism to detect it @akondrahman Bhai:
@Talismanic
I think RQ2 is not answered fully. First of all, the commit_messages.csv
file is messed up and hard to parse. Please make it comma separated so that I can open it up and do some labeling. You can strip all special characters from the messages like , ; \n \r _
etc. In that manner the message content will remain without creating parsing issues. Second, in commit_messages_csv
you also need to label if a bug-related keyword appears. You can indicate that by assigning 1
or 0
in a separate column.
Please make these fixes before you proceed with RQ3. For questions please let me know.
@akondrahman Bhai, Will this formatting work? commit_messages.zip
I have done this for 5 sample repos. If it is ok, I will run it for all repos.
Bugflag = 1 means bug related keywrods are found in the commit message.
@akondrahman Bhai, Another query. We have two datasets. One consists of 488 repositories (master data -1). In the second data set, we have a subset of master data where 156 repository details are there (sub-data-2). By details I mean, repo_name, each test script location and what were the identified test smells in each file. Here at least 1 test smell is present in each file.
For RQ2 should we scrap data-1 or data-2?
Scrapping data-2 is easier as already all test files are there and will take lesser time.
@Talismanic
Formatting of the CSV file looks OK. We need to scrap sub-data-2
@akondrahman Bhai, After observing the commit messages on a high level, my feeling is all those commits are not related to bugs in ansible scripts. There are bug fixes of CI, main software's code, main software's configuration etc. So will this data be sufficient to determine how frequently bugs appear in Ansible scripts?
@Talismanic
Don't worry about the amount of bug-related data at this moment. First separate bug-related commits that are used for Ansible scripts and for Ansible test scripts. Then, give me a % report, like how many bug-related commits are mapped to Ansible scripts, and how many are mapped to Ansible test scripts.
First separate bug-related commits that are used for Ansible scripts and for Ansible test scripts
@akondrahman Bhai, Now we have all commits which has touched at least 1 ansible test script and has at least 1 bug related keywords in the commit messages.
So my next steps should be:
Is my approach ok ?
@Talismanic
Thanks for the update.
Now we have all commits which has touched at least 1 ansible test script and has at least 1 bug related keywords
If you have this already, then no need to separate out Ansible development scripts. Instead, I would ask you to look at the diffs of these commits to see if the bug is for a Ansible development script or an Ansible test script. In either case keep a label like D for development
and T for test
, and generate a CSV file with that info.
@akondrahman Bhai, I have manually checked some diffs. The problem is it's hard to differentiate that a change in code is actually a normal change or bug-fix. Have you done something similar or do you have any reference of such things?
@Talismanic
In my case I used own my judgement. You have to use your own judgement ... whatever you decide is final. Once you are done, please create a CSV with the following headers: COMMIT_HASH, MODIFIED_YAML, TIMESTAMP, LOC_ADDED, LOC_DELETED, BUG_RELATED, BUG_IN_DEV_OR_TEST, SMELL_COUNT_TYPEA_IN_FILE, SMELL_COUNT_TYPEB_IN_FILE, SMELL_COUNT_TYPEC_IN_FILE, SMELL_COUNT_TYPED_IN_FILE,SMELL_COUNT_TYPEE_IN_FILE
@akondrahman Bhaiya,
I have gathered the diffs of all .yml
files from the commits which have at least 1 bug-related keyword in the commit message. The file is actually huge (3.3M lines). I am going to examine hash by hash manually for the next couple of days and try to find out if I find any pattern. I will update you accordingly.
dump_commit_diffs.zip
Thanks. Take your time.
Note to self:
bug_script_type = 0; bug is not in iac dev script or test script bug_script_type = 1; bug is in iac dev script bug_script_type = 2; bug is in test script bug_script_type = 3; bug is in both script
@akondrahman Bhai, In last two days, I have gone through 100k lines of review. I have following observations:
SMELL_COUNT_TYPEA_IN_FILE, SMELL_COUNT_TYPEB_IN_FILE, SMELL_COUNT_TYPEC_IN_FILE, SMELL_COUNT_TYPED_IN_FILE,SMELL_COUNT_TYPEE_IN_FILE
these are not currently measurable w.r.t our previously identified smells.@Talismanic
@Talismanic
The ICST deadline has moved up. We must complete all analysis by Aug 10.
@akondrahman Bhaiya, acknowledged.
@Talismanic
FYI: https://icst2022.vrain.upv.es/track/icst-2022-papers#Call-for-Papers
@akondrahman Bhai, Till now update is, I started with 554 commits, now 329 is still remaining for examination. I hope this examination will be completed by 10th Jul-2021.
Thanks for the hard work @Talismanic !
@akondrahman Bhai, Some updates here:
Commit Type | Count |
---|---|
No Bug in IaC Dev or Test Script | 103 |
Bug in IaC Dev Script | 101 |
Bug in IaC Test Script | 199 |
Bug in both Script | 254 |
Could not parse | 1567 |
I am trying to figure out a way to parse the 1567 commits. I think some character encoding related issue is blocking those.
@Talismanic thanks ... keep me posted.
Dear @akondrahman Bhai, Just now I have been able to finish the second round of review on the rest 1567 commits. Out of those, I could not parse 846 commits. For the rest of the commits, below is the updated summary:
Commit Type | Count |
---|---|
No Bug in IaC Dev or Test Script | 523 |
Bug in IaC Dev Script | 145 |
Bug in IaC Test Script | 360 |
Bug in both Script | 350 |
Could not parse | 846 |
I think here I can pause this commit analysis. Kindly suggest your view.
@Talismanic
I think here I can pause this commit analysis
Sounds good. I think now we have the answers for How frequently do bugs appear in test code for Ansible scripts?
Please share the full CSV where you have everything: parsed commits and un-parsable commits.
As a gentle reminder the CSV should gave the following fields:
COMMIT_HASH, MODIFIED_YAML, TIMESTAMP, LOC_ADDED, LOC_DELETED, BUG_RELATED, BUG_IN_DEV_OR_TEST, >SMELL_COUNT_TYPEA_IN_FILE, SMELL_COUNT_TYPEB_IN_FILE, SMELL_COUNT_TYPEC_IN_FILE, >SMELL_COUNT_TYPED_IN_FILE,SMELL_COUNT_TYPEE_IN_FILE
One thing @akondrahman Bhai, if I give CSV file loc added and loc deleted sometimes becoming unreadable. I did the analysis on txt files to keep the readability. Will I provide CSV files or text files?
@Talismanic
you will give numbers for both loc_added
and loc_deleted
, not the text itself. So if 5 lines are added in a diff, then loc_added
will be 5
.
@Talismanic
you will give numbers for both
loc_added
andloc_deleted
, not the text itself. So if 5 lines are added in a diff, thenloc_added
will be5
.
Got it Bhaiya
@akondrahman Bhai, kindly find the attached data. For adding smell count type with this data, I need some more analysis.
Data Description Project name commit_hash file_name: YAML file which has been changed commit_date loc_added: number of lines added in this specific file in this commit loc_deleted bug_script_type: 0 => No bug in dev or test, 1 => bug in dev, 2=> bug in test, 3 => bug in both dev & test
@akondrahman Bhai, I have found that I had another table with a file-wise antipattern count. Just had to join these two. Please find the required data attached.
Data Description Project name commit_hash file_name: YAML file which has been changed commit_date loc_added: number of lines added in this specific file in this commit loc_deleted bug_script_type: 0 => No bug in dev or test, 1 => bug in dev, 2=> bug in test, 3 => bug in both dev & test, -1 => the diff could not be parsed. SAL = Skip Ansible Lint LOT = Local Only test AR = Assertion Roulette ED = External Dependency NEC = No Env Clean Up
@Talismanic why is bugflag
= 1
for all commits? What does this field mean?
@Talismanic I am also noticing other problems:
AR
always 1 ? We did not identify assertion roulette in ALL scripts. NEC
always 0 ? There must be some scripts with NEC instances. @Talismanic I am also noticing other problems:
- why is
AR
always 1 ? We did not identify assertion roulette in ALL scripts.- why is
NEC
always 0 ? There must be some scripts with NEC instances.
Bhaiya, there was an issue in my update query. I was searching in MySQL with a where clause with a Text field(file_name). That produced errors. I have fixed this issue. An updated LOC summary is attached.
Got it. @Talismanic why is bugflag = 1 for all commits? What does this field mean? Btw, happy birthday!
why is bugflag = 1 for all commits?
Sorry Bhaiya. I missed this question. I have only provided you the data where bugflag = 1. This means in the commit messages of these commits, I have found bug related keywords. Apart from that while analyzing diffs I only analyzed only for these commits (where bugflag = 1).
However, we have total data where there are other commits where we did not find bug-related keywords. I am attaching that also. Those commits are marked as bugflag = 0.
Btw, happy birthday!
Thanks, Bhaiya for the wish though I am not a birthday celebrating person. Most probably this is first time in my life I cut cake to make the kiddo happy.
One more clarification @akondrahman Bhai. bug_script_type = -1 means the diff could not be parsed.
This settles it. Thanks!
New RQs: