ai-se / bellwether_community

Bellwether Community detection with JS projects using r2c
GNU General Public License v3.0
2 stars 0 forks source link

1385 defect prediction datasets #14

Open Suvodeep90 opened 5 years ago

Suvodeep90 commented 5 years ago

todo

expectations

The expectation from results:

  1. adequacy of predictors (Pd > 66, pf < 33)
  2. FSS Is useful
  3. Hyperparameter optimization is useful
  4. it all scales
  5. stable conclusion across
  6. stable conclusion locally
Suvodeep90 commented 5 years ago

Baseline results: Out of 1385 projects, I selected the ones with more than 50 rows in the dataset, that gave us 711 projects. Some projects had some issues. Final dataset has 633 projects.

We applied LR(9 parameters tuned using DE). DE was tuned for F1 measure. And the chart shows f1-measure for individual projects.

In this experiment - each project was first divided into a train(67%) and test set(33%), then the train data was further divided into train and tune set with 5 fold cross-validation. The model was trained on 80% of the data and tuned on 20%. Then the best model was tested using the test set. 1385_baseline

timm commented 5 years ago

looks very nice

Out of 1385 projects, I selected the ones with more than 50 rows in the dataset, that gave us 711 projects. Some projects had some issues. Final dataset has 633 projects.


  1. u are reporting F1. give me pf and recall as separate charts
  2. u r optimizing for F1, right? why not optimize for minimizing false alarm and maximizing recall?
  3. your experiment is a 3-way. can you show all 3 results? (sort the chart on the median value)
  4. can your charts include the results using default LR?
  5. you say 711 projects. there only seems 100 here, right? where are the other 600?
  6. no fss? that's cool but please clarify.
Suvodeep90 commented 5 years ago
  1. I calculated all but forgot to store the results apart from the goal in the first run.
  2. I will do that.
  3. I will do that
  4. I will include the results for default LR in the next result as well.
  5. The chart has points for all 633 projects. Due to the size, only a few labels are getting shown.
  6. The feature selector is CFS(default). I am still having issues integrating it with DE.
Suvodeep90 commented 5 years ago

results spreadsheet

Default LR on the dataset: The Data was divided into Train and test set using a 5 fold cross validation with 5 repeats. There is no FSS, no Smote and use of default parameter setting for LR from sklearn.

The results are shown here are median score of 25 runs on each project. The meaures contain - precision,recall, f1-score, d2h, g-score, pci20, ifa. https://docs.google.com/spreadsheets/d/1uwN4dI-kpkparZ5tj15yl7_jwwaK8lDDIU0-zc9qgyM/edit?usp=sharing

Median of Median Results: F1:0.69 Prec:0.69 Recall:0.71 g-score:0.46 d2h:0.72 pci20:95 ifa:5

  1. F1 Score: F1 Score 1385 dataset
  2. Recall Score: recall vs  project
  3. Precision Score: precision vs  project
Suvodeep90 commented 5 years ago

10 Random Project tuned with DE for f1-measure results - With FSS as CFS, Smotuned(tuned for 3 parameters), LR tuned for 10 hyper-parameters. with 5*5 cross-validation. https://docs.google.com/spreadsheets/d/1hVMbGE4lUvWWlH5nFna5Dvyy0nLDF2nmzqHz4rOdS4Q/edit?usp=sharing

f1, DE_f1, precision, DE_precision, recall…

Here hyper-parameter optimizer is not helping. Investigating the reason.

Suvodeep90 commented 5 years ago

Dataset Description(columns):

'TLOC', 'TNF', 'TNC', 'TND', 'LOC', 'CL', 'NStmt', 'NFunc', 'RCC', 'MNL', 'avg_WMC', 'max_WMC', 'total_WMC', 'avg_DIT', 'max_DIT', 'total_DIT', 'avg_RFC', 'max_RFC', 'total_RFC', 'avg_NOC', 'max_NOC', 'total_NOC', 'avg_CBO', 'max_CBO', 'total_CBO', 'avg_DIT.1', 'max_DIT.1', 'total_DIT.1', 'avg_NIV', 'max_NIV', 'total_NIV', 'avg_NIM', 'max_NIM', 'total_NIM', 'avg_NOM', 'max_NOM', 'total_NOM', 'avg_NPBM', 'max_NPBM', 'total_NPBM', 'avg_NPM', 'max_NPM', 'total_NPM', 'avg_NPRM', 'max_NPRM', 'total_NPRM', 'avg_CC', 'max_CC', 'total_CC', 'avg_FANIN', 'max_FANIN', 'total_FANIN', 'avg_FANOUT', 'max_FANOUT', 'total_FANOUT', 'NRev', 'NFix', 'avg_AddedLOC', 'max_AddedLOC', 'total_AddedLOC', 'avg_DeletedLOC', 'max_DeletedLOC', 'total_DeletedLOC', 'avg_ModifiedLOC', 'max_ModifiedLOC', 'total_ModifiedLOC','Buggy'

timm commented 5 years ago

Thirteen questions

i feel like we are closing in on a methodology, a check list of sorts for large scale data mining. e.g. if we started building a library of sanity check demons as per for first point then we'd have a product to offer other people

please, for the 1385 can you clarify the following:

  1. what sanity checks on the data?
    • In the original paper out of 235K projects 1385 projects were selected based on:
      • Filtering Out Projects by Programming Languages(only object-oriented i.e .c, .cpp, .cxx, .cc, .cs, .java, and *.pas).
      • Filtering Out the Projects with a Small Number of Commit(filter out the projects with less than 32 which is less then 25% quantile of the number of commits on all projects).
      • Filtering Out the Projects with Lifespan Less Than One Year.
      • Filtering Out the Projects with Limited Defect Data(i.e. count the number of fix-inducing and non-fixing commits from a one-year period. We choose the 75 % quantile of the number of fix-inducing (respectively non-fixing) commits as the threshold to filter out the projects with fewer defect data).
      • Filtering Out the Projects Without Fix-Inducing Commits.
    • I selected the ones with more than 50 rows in the dataset
    • Some projects had some issues (i.e. projects don't have enough buggy to non-buggy ratio to create a stratified k=5 fold cross-validation).
      • reduced 711 projects to 633 projects (if in cross val there were zero defect buckets in train or test, removed those results).
    • you need more. please code up the checks in https://bura.brunel.ac.uk/bitstream/2438/7926/2/TSE_NASADataQualNote_V26.pdf, plus any else you can think of
      • We do need to apply our GitHub 6 point check.(not done yet)
  2. how much data culled by the sanity checks?
    • original paper's sanity check culled 99% of projects.
    • New sanity check culled 54% of the remaining.
  3. what attributes
    • Listed above. All product measures I think, no process or personnel. (Most are product metrics collected using understand tool then converted to the following attributes, there are few process metrics like "Number of revisions","Number of revisions a file was involved in bug-fixing" etc. , . There are raw data collected using understand and then there are converted metrics which I am using now as did the authors).
      • process metrics: Number of revisions, Number of revisions a file was involved in bug-fixing, Lines added, Lines deleted, Lines modified.
  4. how many rows?
    • Final dataset has 633 projects.
  5. what does each row represent?
    • each row is a class/file
  6. how is class labelling done?
    • 58% of projects had issues tracking system, bug reports were collected from there and labeled using that.
    • 42% projects didn't have a tracking system, they were labeled using the keyword search in the commit message (i.e. bug|fix|error|issue|crash|problem|fail|defect|patch).
  7. what FSS
    • CFS is being used for feature selection. Except for default results which doesn't use any FSS.
  8. what learner
    • LR
    • why only LR? (I will be using other learners as well. LR was fast to train, I tried with SVM it was slow but i have coded up to use other learners as well)
  9. what pre-processing
    • none?
    • SMOTEUNED is being used for class imbalance.
    • Planning to use SMOTE for default models without Hyper-parametre optimization.
  10. what hyperparaemeter optimization on the pre-precessor?
    • none?
    • DE
  11. what hyperparaemeter optimization on the learner?
    • differential evolution on LR
    • optimizing for F1
    • which means that PF can get sometimes very high,
    • can we run the optimizer again, this time for max pd, min pf? (Used pd for optimization i have few results but not formatted yet)
    • Q: do u know how to do multi-objective optimization with DE? A: use the cdom score... do u know how to do that? (I have never done that before, but i will code it up)
  12. what success criteria?
    • pf, ifa, d2h
    • recall, pd (you've got both, why are they different numbers?
    • pci_20. is that the same as my popt20? where the learners dived modules into guess=defective|ok then sort defective on LOC, the sort ok on LOC, then run over that sort, defective before or?
  13. what is the train/test rig?
    • train 66, and test set(33%),
    • 5 way cross val on train, best model see after testing on 20% of 66% based on to test

there are some wondrous things about the results:

timm commented 5 years ago

there's something wrong with your options.

lami and ambidata has the expected effect (DE improves things). the other datasets do not see that improve

image

but the other data sets show no improvement.

so:

This new results are from the new dataset, i think the data quality is better than the commit guru dataset. The same rig is being applied to the both dataset.

Suvodeep90 commented 5 years ago

I am not exactly sure why this is happening. But i am rechecking if there is some issue or not. In the mean time i will train for pd. I have not tried to optimize DE with dom score before, but i know the concept, i am trying to code that up.

Also I will update the answers to the last comment shortly.

timm commented 5 years ago

amazingly good

if it essentially the same rig as used for CommitGuru150 then the source of this quick might be in the data, not your algorithms. the performance here is amazingly good. did u read the paper this data came from? (*) all the careful human curation? this data set could be somewhat exceptional

(*) BTW, applying the rule of KNOW THY DATA, that is a requirement, not. a question.

The original dataset was collected by Dr. Mockus from SourceForge and GoogleCode.


suvodeep comments that this data was severely pruned prior to analysus

the results here seen high w.r.t other results we have seen in other data sets. this could either be a selection bias here (but the sanity checks seem sane) or it could be the result of poor pre-processing in other data sets.

timm commented 5 years ago

could you edit https://github.com/ai-se/bellwether_community/issues/14#issuecomment-508883462 and make sure all the 13 points are addressed?

timm commented 5 years ago

Lessons Learned

need to assess learners via their tune ability

And LR is fast! and polynomial SVM is slow

when you look at lots of data, you discard most of it. e.g

timm commented 5 years ago

pars inside NB

local Lib    = require("lib")
local Burn   = require("burn")
local Learn  = require("learn")
local Object = require("object")

Nb = {}
Nb = Object:new{maybes=0, name="nb"}

--------- --------- --------- --------- --------- --------- 
function Nb:new()
  local n=Object.new(self)
  n.datas={}
  return n
end

function Nb:like(cells, data1, m,k, n,     inc)
  local prior= ( #data1.rows + k ) / ( n+ k*self.maybes )
  local like = math.log(prior)
  for _,col in pairs(data1.x.cols) do 
    local x = cells[ col.pos ]
    if x ~= Burn.ignore then
      if col:nump() then
        inc = Lib.normal(x, col.mu, col:sd())
      else
        inc = ( (col.counts or 0) + m * prior ) /
              ( #data1.rows + m )
      end
      like = like + math.log(inc) end end 
  return like
end

function Nb:what(cells,data0)
  local c = cells[ data0._class.pos ]
  if not self.datas[c] then 
    self.maybes = self.maybes + 1
    self.datas[c] = data0:clone()
  end
  return self.datas[c]
end

function Nb:inc(cells,data0)
  local data1 = self:what(cells,data0)
  data1:inc(cells)
  data0:inc(cells)
end

function Nb:predict(cells,data0)
  local max,got = -10^32,nil
  for h1,data1 in pairs(self.datas) do
    got = got or h1
    local l = self:like(cells, data1, Burn.nb.m, 
                        Burn.nb.k, #data0.rows)
    if l > max then  max,got = l,h1 end 
  end
  return got
end

function Nb:report(era,goal,log)
  Learn.report(era,log,self.name,goal)
end

return Nb
timm commented 5 years ago

generality experiment

feature extractor: is the same model?

http://menzies.us/pdf/12localb.pdf

image

image

timm commented 5 years ago

Tuning heuristics

timm commented 5 years ago

Generation heuristics

image

Suvodeep90 commented 5 years ago

buggyness_hist buggyness pf pd log-ifa popt20 d2h g-score Precision F1 - Score

timm commented 5 years ago

hey... what's the above new charts?

these are the new charts with updated metric calculations .. I just wanted to store them somewhere. not ready for reporting yet..

Suvodeep90 commented 5 years ago

Use of BIRCH algorithm in SE: https://scholar.google.com/scholar?hl=en&as_sdt=5%2C34&sciodt=0%2C34&cites=16182634092879736016&scipsc=1&q=source%3AIEEE+Transaction+on+Software+Engineering&btnG=

Suvodeep90 commented 5 years ago

Time:

160 projects: - 73053 secs

timm commented 5 years ago
Suvodeep90 commented 5 years ago

Hierarchical Clustering results:

[cluster_id=0] N_children: 9 N_samples: 697

[cluster_id=1] N_children: 9 N_samples: 87

[cluster_id=11] N_children: 2 N_samples: 4

[cluster_id=14] N_children: 10 N_samples: 103

[cluster_id=25] N_children: 0 N_samples: 2 [cluster_id=26] N_children: 0 N_samples: 4 [cluster_id=27] N_children: 0 N_samples: 1 [cluster_id=28] N_children: 6 N_samples: 64

[cluster_id=35] N_children: 19 N_samples: 222

[cluster_id=55] N_children: 20 N_samples: 210

timm commented 5 years ago
  1. rejig report
    • n>.66
    • n < 0.33
  2. add stats test against the original and each other. color in back the winner and anyone sillier to the winner
  3. think about: Not using old data (the self test problem)
  4. Scale up bellwether
    • don't do N^2 do 20*N
    • don't do similar projects. Collect column stats, and
      • Sort by difference, take the 20 projects most distance to you
    • cluster once with birch: build bellwether from most central project in each leaf
  5. Better learns
    • Fftrees: I predict u won't need fss and precision will be better if you discretize for precision
Suvodeep90 commented 5 years ago

Ideas:

To apply hyper-parameter optimization .. if we tune each project separately and apply and select the best model out of them to use it as bellwether.

Suvodeep90 commented 5 years ago

Stats Test (compare with self test)

F1 score image

Precision image

Recall image

g-score image

pf image

Suvodeep90 commented 5 years ago

in each community(cluster) can we find bellwether using stats test, and the models who rank best are the bellwether?

timm commented 5 years ago

we can if we want to be slow. but lets see what happens if we bellwether on steroids first

timm commented 5 years ago

any news on 600+?