Open FAMILIAR-project opened 5 years ago
df.query("kernel_size == 7317008").apply(nbyes, axis=1)
should give 240
with the following implementation
NO_ENCODED_VALUE = 0
YES_ENCODED_VALUE = 1
M_ENCODED_VALUE = 2
def nbyes(row):
return sum(row == YES_ENCODED_VALUE)
def nbno(row):
return sum(row == NO_ENCODED_VALUE)
def nbmodule(row):
return sum(row == M_ENCODED_VALUE)
df['nbyes'] = df.apply(nbyes, axis=1)
df['nbno'] = df.apply(nbno, axis=1)
df['nbmodule'] = df.apply(nbmodule, axis=1)
df['nbyesmodule'] = df['nbyes'] + df['nbmodule']
it gives
kernel_size | nbno | nbyes | nbmodule | nbyesmodule |
---|---|---|---|---|
7304656 | 6273 | 6340 | 0 | 6340 |
(only nbmodule seems correct) So either my code is not correct, or the encoding of 'y' and 'n' is specific to a column
Another question. The more options have 'n', smaller the kernel size is?
That's something to verify but intuitively yes. 'n' is the opposite of 'y' ;) (we have to be careful with 'm' aka modules since they are not directly part of the kernel binary)
Updated the dataset. All 0 means no, 1 means yes and 2 means module.
We find tiny kernel with 240 options at yes.
The pearson correlation coefficient for nbyes and size is low (<0.2), same for nbno.
The intuition is the more options have 'y', more the kernel size is. I have an older implementation here: https://nbviewer.jupyter.org/github/TuxML/ProjetIrma/blob/dev/miscellaneous/csv_kernels/TUXML-basic.ipynb
I am porting it right now. Then we will have to experiment with and without this hand-crafted feature (does it pay off wrt accuracy?)
@HugoJPMartin can we assume that all values are encoding in the same way (eg 'n' is 0, 'y' is 1, 'm' is 2)? what's the encoding?