TuxML / compilation-analysis

Analysis of 125K+ configurations of the Linux kernel (build/compilation phase)
0 stars 3 forks source link

What are the relevant data of the error logs? #4

Open jualvespereira opened 5 years ago

jualvespereira commented 5 years ago

Some relevant information that should be considered for clustering:

'make.: **.+' '. 1:.+' '. error:.+' '.* undefined reference.+'

FAMILIAR-project commented 5 years ago

There are two radical approaches for clustering:

Of course the solution is neither completely black nor completely white... An hybrid approach is to find "generic" patterns (like @jualvespereira proposes). Another is to use some knowledge we gather throughout the review of failures.

For instance, I have come across some patterns:

and I implemented some ad-hoc regex something like

for err in err_logs_configuration(cid).splitlines(): 
        if "read_overflow2" in err:
            print (err)

maybe we can have pre-defined regex for labelling failures... and fully automated techniques for the rest.

Final remark: we may have more than one cluster attached to a failure -- see this failure https://github.com/TuxML/compilation-analysis/issues/1#issuecomment-484949833

jualvespereira commented 5 years ago

I extracted the four pieces of information above and then I clustered using brute force. I should optimize the script since I have too many clusters making unfeasible the use of such an algorithm. Some ways that come in my mind to make it feasible:

FAMILIAR-project commented 5 years ago

Interesting ideas, go ahead!

jualvespereira commented 5 years ago

I used the data frame created in issue #5 to cluster the errors and I got 32 clusters. I removed the search for 'make.: *.+' errors (that may be not so significant) and 166 cids were not classified (i.e., I couldn't restrict its relevant error information by using just '. 1:.+', '. error:.+', '. undefined reference.+'). I'll try to use tfidf and k-means to discover the top terms to cluster.

jualvespereira commented 5 years ago

I'm able to cover all error logs after using k-means to discover the top terms to cluster. For the clustering, I used 4 terms ('. error:.+', 'undefined reference.+', '. 1:.+', '.*aicasm.+') and I considered the error that comes first in each error message. We have a total of 16 clusters. I was able to classify 12 of them by looking at the issues (TuxML/ProjetIrma), qualitative analysis of the bug, and decision tree. You can find attached the file with further details. For each log error, we have:

logErr_detail.xlsx