Ameobea / orange3

Orange 3 data mining suite: http://orange.biolab.si
Other
1 stars 0 forks source link

Daily Progress Updates #18

Closed ameo-unito-bot closed 8 years ago

ameo-unito-bot commented 8 years ago

┆Issue is synchronized with this Asana task

Ameobea commented 8 years ago

After corresponding with some members of the Orange project team, the idea of using Python's __repr__ function to generate code for internal objects has been decided upon. @Pelonza, this solves the problem of preprocessor code generation quite elegantly and makes it possible for preprocessors, classifiers, and some other widget types to be converted to code without having to construct the actual widget object.

I've made a new branch, https://github.com/Ameobea/orange3/tree/repr, on which I will be creating the __repr__ functions for orange internal objects. I be submitting a PR into the Orange project separate to the main code generation code. After that, I will work on applying the new code into the code generation code I currently have, hopefully causing a significant boost in code efficiency, output readability, and overall cleanness of the code generation process.

Ameobea commented 8 years ago

I've done a basic __repr__ generator for TreeLearner and generated the following:

TreeLearner(criterion=entropy, splitter=best, max_depth=100, min_samples_split=7, min_samples_leaf=2, preprocessors=[PreprocessorList([
    Discretize(method=EqualFreq(n=4), remove_const=False),
    ProjectPCA(n_components=10),
    Orange.widgets.data.owpreprocess._Randomize(rand_type=RandomizeClasses),
    Orange.widgets.data.owpreprocess._Scaling(center=Orange.widgets.data.owpreprocess.mean, scale=Orange.widgets.data.owpreprocess.std),
    Orange.preprocess.fss.SelectRandomFeatures(k=10),
    Orange.preprocess.fss.SelectBestFeatures(method=Orange.preprocess.score.InfoGain, k=10, threshold=None, decreasing=True),
    Continuize(zero_based=True, multinomial_treatment=Indicators),
    Impute(method=Orange.preprocess.impute.Average),
]), RemoveNaNClasses(), Continuize(zero_based=True, multinomial_treatment=Indicators), Orange.preprocess.fss.RemoveNaNColumns(threshold=None), SklImpute(strategy='mean')], )

I put code in to leave out attributes that are default to try to make the result as readable and clear as possible, but this one in particular is very large due to the input preprocessor widget which contains all available preprocessors for testing purposes.

In any case, I think this is a good sign for the feasibility of this method and will work to continue this for other learners, preprocessors, etc.

Ameobea commented 8 years ago

It turns out that __repr__s were already created for every learner in three lines using python witchcraft without having to write separate __repr__ for each learner individually. The only thing missing was the preprocessors attribute, which wasn't part of params. In addition, the existing method doesn't account for leaving out default attributes, but I think that the size and scope of the code that would be needed to improve this isn't worth it.

I added this in on a new branch (https://github.com/Ameobea/orange3/tree/repr2) and will continue my work there instead of the old repr branch.

Ameobea commented 8 years ago

I merged the repr2 branch (which was cloned from the most recent master of the main project) into the code generation branch after adding some small commits for adding reprs for some preprocessors and adding the preprocessors attribute to learners. All `reprbased changes are in the branchrepr` which contains no code generation commits.

For the code generator, I improved a couple of the already implemented code generators with the updated repr code and am already impressed at how well it improves both the readability of the output code and the simplicity of the code gen init code. I'm very optimistic for future progress and think that I can have most of the widgets working with repr-based code generators by the end of the week.

I also tried out using one code generator for all learner-based widgets by creating a code generator in owlearnerwidget. Since all of the code in the individual learner widgets seems to be based around the learner, it's possible to create one unified code generator for all learners.

One issue I see right now is dependencies and telling the code generator how to include them, but I think that it won't be too difficult to implement. Worse case scenario I could add in a list of static dependencies for all outputted scripts regardless of which widgets are in them if it's impossible to figure that out dynamically.

TL;DR Lots of improvements with __repr__ generation, improvements to code generators using therepr` changes, and reduction in code clutter in the code gen inits and the output script.