Open FAMILIAR-project opened 5 years ago
I've put some instructions in the README:
tuxml.py
)tuxml.py
to load a pre-encoded dataset (it returns a pandas dataframe):
import tuxml
df = tuxml.load_dataset()
An example is given with size-analysis-fast.ipynb
Note: the datatset is loaded here: ../tuxml-size-analysis-datasets/all_size_withyes.pkl
so be careful about relative paths and your git repo locations
Hum... I realize that all_size_withyes.pkl
contains options having unique values... We can safely remove them.
It can be a problem for multicollinearity (and can make better scale ML algorithms)
I will try to release a new dataset ASAP...
I've released a new version of the dataset. Please update your git (git pull) https://gitlab.com/FAMILIAR-project/tuxml-size-analysis-datasets/
don't be surprise the 'nbyes' differs: it is now computed over features that remain in the dataset (so it's the old nbyes minus the number of features having an unique value)
Right now, we all have an ad-hoc method for loading the dataset. We need to unify the process. So is here the plan:
.pkl
file to speed up the processingnbyes
features by default