TuxML / size-analysis

Analysis of 125+ Linux configurations (this time for predicting/understanding kernel sizes)
2 stars 1 forks source link

Number of 'y' values in a configuration is a good feature #7

Open FAMILIAR-project opened 5 years ago

FAMILIAR-project commented 5 years ago

The intuition is the more options have 'y', more the kernel size is. I have an older implementation here: https://nbviewer.jupyter.org/github/TuxML/ProjetIrma/blob/dev/miscellaneous/csv_kernels/TUXML-basic.ipynb

I am porting it right now. Then we will have to experiment with and without this hand-crafted feature (does it pay off wrt accuracy?)

@HugoJPMartin can we assume that all values are encoding in the same way (eg 'n' is 0, 'y' is 1, 'm' is 2)? what's the encoding?

FAMILIAR-project commented 5 years ago

df.query("kernel_size == 7317008").apply(nbyes, axis=1) should give 240

with the following implementation

NO_ENCODED_VALUE = 0
YES_ENCODED_VALUE = 1
M_ENCODED_VALUE = 2

def nbyes(row):
    return sum(row == YES_ENCODED_VALUE)

def nbno(row):
    return sum(row == NO_ENCODED_VALUE)

def nbmodule(row):
    return sum(row == M_ENCODED_VALUE)

df['nbyes'] = df.apply(nbyes, axis=1)
df['nbno'] = df.apply(nbno, axis=1)
df['nbmodule'] = df.apply(nbmodule, axis=1)
df['nbyesmodule'] = df['nbyes'] + df['nbmodule']

it gives

kernel_size nbno nbyes nbmodule nbyesmodule
7304656 6273 6340 0 6340

(only nbmodule seems correct) So either my code is not correct, or the encoding of 'y' and 'n' is specific to a column

arnobl commented 5 years ago

Another question. The more options have 'n', smaller the kernel size is?

FAMILIAR-project commented 5 years ago

That's something to verify but intuitively yes. 'n' is the opposite of 'y' ;) (we have to be careful with 'm' aka modules since they are not directly part of the kernel binary)

HugoJPMartin commented 5 years ago

Updated the dataset. All 0 means no, 1 means yes and 2 means module.

We find tiny kernel with 240 options at yes.

The pearson correlation coefficient for nbyes and size is low (<0.2), same for nbno.