What are the relevant data of the error logs?

jualvespereira commented 5 years ago

Some relevant information that should be considered for clustering:

'make.: **.+' '. 1:.+' '. error:.+' '.* undefined reference.+'

FAMILIAR-project commented 5 years ago

There are two radical approaches for clustering:

fully automated with the usual challenges (dealing with too much noise, with almost similar yet different content, with clusters that may "subsume" other clusters, etc.)
manual with pattern matching: yes, but you need to know what to search

Of course the solution is neither completely black nor completely white... An hybrid approach is to find "generic" patterns (like @jualvespereira proposes). Another is to use some knowledge we gather throughout the review of failures.

For instance, I have come across some patterns:

`undefined reference to backlight https://github.com/TuxML/ProjetIrma/issues/148
error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter __read_overflow2(); https://github.com/TuxML/ProjetIrma/issues/145
undefined reference to `v4l2_ https://github.com/TuxML/ProjetIrma/issues/141
undefined references to `crc32_le' https://github.com/TuxML/ProjetIrma/issues/143

and I implemented some ad-hoc regex something like

for err in err_logs_configuration(cid).splitlines(): 
        if "read_overflow2" in err:
            print (err)

maybe we can have pre-defined regex for labelling failures... and fully automated techniques for the rest.

Final remark: we may have more than one cluster attached to a failure -- see this failure https://github.com/TuxML/compilation-analysis/issues/1#issuecomment-484949833

jualvespereira commented 5 years ago

I extracted the four pieces of information above and then I clustered using brute force. I should optimize the script since I have too many clusters making unfeasible the use of such an algorithm. Some ways that come in my mind to make it feasible:

Detect information noises by investigating automatically generated clusters from a sample of cids of each config option in the decision tree and then ignore such information.
Ignore the error order.
Sort the errors first before grouping.
Compute clusters of a small sample of randomly chosen configs.

FAMILIAR-project commented 5 years ago

Interesting ideas, go ahead!

jualvespereira commented 5 years ago

I used the data frame created in issue #5 to cluster the errors and I got 32 clusters. I removed the search for 'make.: *.+' errors (that may be not so significant) and 166 cids were not classified (i.e., I couldn't restrict its relevant error information by using just '. 1:.+', '. error:.+', '. undefined reference.+'). I'll try to use tfidf and k-means to discover the top terms to cluster.

jualvespereira commented 5 years ago

I'm able to cover all error logs after using k-means to discover the top terms to cluster. For the clustering, I used 4 terms ('. error:.+', 'undefined reference.+', '. 1:.+', '.*aicasm.+') and I considered the error that comes first in each error message. We have a total of 16 clusters. I was able to classify 12 of them by looking at the issues (TuxML/ProjetIrma), qualitative analysis of the bug, and decision tree. You can find attached the file with further details. For each log error, we have:

configuration options responsible for the error
number of directly related errors
number of indirectly related errors
which errors dominate this one
cause of the error

logErr_detail.xlsx

TuxML / compilation-analysis

What are the relevant data of the error logs? #4