Open jualvespereira opened 5 years ago
There are two radical approaches for clustering:
Of course the solution is neither completely black nor completely white... An hybrid approach is to find "generic" patterns (like @jualvespereira proposes). Another is to use some knowledge we gather throughout the review of failures.
For instance, I have come across some patterns:
error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter __read_overflow2();
https://github.com/TuxML/ProjetIrma/issues/145and I implemented some ad-hoc regex something like
for err in err_logs_configuration(cid).splitlines():
if "read_overflow2" in err:
print (err)
maybe we can have pre-defined regex for labelling failures... and fully automated techniques for the rest.
Final remark: we may have more than one cluster attached to a failure -- see this failure https://github.com/TuxML/compilation-analysis/issues/1#issuecomment-484949833
I extracted the four pieces of information above and then I clustered using brute force. I should optimize the script since I have too many clusters making unfeasible the use of such an algorithm. Some ways that come in my mind to make it feasible:
Interesting ideas, go ahead!
I used the data frame created in issue #5 to cluster the errors and I got 32 clusters. I removed the search for 'make.: *.+' errors (that may be not so significant) and 166 cids were not classified (i.e., I couldn't restrict its relevant error information by using just '. 1:.+', '. error:.+', '. undefined reference.+'). I'll try to use tfidf and k-means to discover the top terms to cluster.
I'm able to cover all error logs after using k-means to discover the top terms to cluster. For the clustering, I used 4 terms ('. error:.+', 'undefined reference.+', '. 1:.+', '.*aicasm.+') and I considered the error that comes first in each error message. We have a total of 16 clusters. I was able to classify 12 of them by looking at the issues (TuxML/ProjetIrma), qualitative analysis of the bug, and decision tree. You can find attached the file with further details. For each log error, we have:
Some relevant information that should be considered for clustering:
'make.: **.+' '. 1:.+' '. error:.+' '.* undefined reference.+'