Generic load dataset - Githubissues

TuxML / size-analysis

Analysis of 125+ Linux configurations (this time for predicting/understanding kernel sizes)

2 stars 1 forks source link

Generic load dataset #10

Open FAMILIAR-project opened 5 years ago

FAMILIAR-project commented 5 years ago

Right now, we all have an ad-hoc method for loading the dataset. We need to unify the process. So is here the plan:

[x] a generic load_dataset() method that does the very basic processing of the dataset (eg cid > 30000)... an outcome can well be a .pkl file to speed up the processing
[x] we can compute/include nbyes features by default
[x] instead of our own server, I propose to use git-lfs https://git-lfs.github.com/ and this repo: https://gitlab.com/FAMILIAR-project/tuxml-size-analysis-datasets/ in order to host files
[x] for making it reusable (not by copy and paste), maybe a pure Python script is possible
[ ] a success story of the two previous points is that all our procedures (neural network, linear regression, tree-based methods, etc.) rely on the same load_dataset()

FAMILIAR-project commented 5 years ago

I've put some instructions in the README:

git clone https://github.com/TuxML/size-analysis/ (to get tuxml.py)
git clone https://gitlab.com/FAMILIAR-project/tuxml-size-analysis-datasets/ (to get datasets)
then you can use tuxml.py to load a pre-encoded dataset (it returns a pandas dataframe):
```
import tuxml
df = tuxml.load_dataset()
```
An example is given with size-analysis-fast.ipynb Note: the datatset is loaded here: ../tuxml-size-analysis-datasets/all_size_withyes.pkl so be careful about relative paths and your git repo locations

FAMILIAR-project commented 5 years ago

Hum... I realize that all_size_withyes.pkl contains options having unique values... We can safely remove them. It can be a problem for multicollinearity (and can make better scale ML algorithms)

I will try to release a new dataset ASAP...

FAMILIAR-project commented 5 years ago

I've released a new version of the dataset. Please update your git (git pull) https://gitlab.com/FAMILIAR-project/tuxml-size-analysis-datasets/

don't be surprise the 'nbyes' differs: it is now computed over features that remain in the dataset (so it's the old nbyes minus the number of features having an unique value)