Daily Progress Updates + Main Discussion Thread

Ameobea / orange3

Orange 3 data mining suite: http://orange.biolab.si

Other

1 stars 0 forks source link

Daily Progress Updates + Main Discussion Thread #8

Closed ameo-unito-bot closed 8 years ago

ameo-unito-bot commented 8 years ago

┆Issue is synchronized with this Asana task

Ameobea commented 8 years ago

Today, I created code gen inits for other widgets and got code generation for a complete workflow (image included below) working from start to finish. The only manual modifications I had to make were due to an input/output linker bug that I will fix soon. (Generated code is also linked below)

@kernc I read over your examples and proposed generation strategy a few times and think I have a bit more to say about my code generation model.

Instead, I propose there be no separate string-generating function in every widget, but that widgets define their "main", exportable functional behaviour in a separate method

I do not use a string generator function in each widget, nor do I use a singular "main" function that performs the widget's actions. Instead, I opted for a mix between the two that initializes a code generator object with data from the widget (including something that could be thought of as a "main" function") and provides ways to tweak and modify that information to produce clean and efficient code down the line.

Here's an example of the current code generator for owclassificationtreegraph, the visualizer for classification trees: (I've added in comments explaining what each part does)

    def init_code_gen(self):
        def pre():
            qapp = QApplication([])

        def pre2():
            # Import code can't recognize constants, so import manually
            from sklearn.tree._tree import TREE_LEAF

        # "Main" function of the widget.  This code is run during widget exection in the output script
        def run():
            # Construct a new classification tree graph
            ow = OWClassificationTreeGraph()
            ow.handleNewSignals()
            # Add the input classifier from the parent widget
            # This is linked by the code generator automatically to the variable `input_classifier`
            ow.ctree(input_classifier)
            # Display the visualization
            ow.show()
            qapp.exec()
            update_selection(ow)
            ow.handleNewSignals()
            # Close the widget and proceed with execution
            ow.saveSettings()
            ow.onDeleteWidget()

        gen = self.code_gen()
        # Give the code generator a copy of the widget
        gen.set_widget(self)
        # Generate import statements for dependencies
        gen.add_import([QApplication, OWClassificationTreeGraph, numpy])
        # Insert the contents of the preamble functions at the beginning of the script if they're not already there
        gen.add_preamble(pre)
        gen.add_preamble(pre2)
        # Set `run` as the main function
        gen.set_main(run)
        # Copy non-class functions from widget code into the output
        gen.add_extern(self.update_selection)
        gen.add_extern(_leaf_indices)
        gen.add_extern(_subnode_range)
        # Modify some lines to suit the classless execution environment
        gen.null_ln("send(")
        gen.add_repl_map(("self", "ow"))
        # Link the output so it can be read by other widgets
        gen.add_output("data", "data", iscode=True)
        return gen

Here is the current version of the generated code produced by this workflow:

https://ameo.link/u/bin/2lb

I will admit that this is almost certainly not the most efficient way to accomplish the task. However, I'm still not (anywhere near) 100% with my knowledge of Orange3 and I will continue to optimize and improve these as I continue development.

As I mentioned in the comments, linking of input and output channels is handled automatically by the code generator. The code generator and the code generator initializer functions are only called once during the code generation process don't interfere at all with other widget functionality.

Don't be afraid to modify widgets or to introduce/enforce conventions that you need to simplify this use case. The more declarative, the nicer.

Moving forward, I will consider more aggressive code generation techniques that could produce more efficient code by modifying the widget itself.

kernc commented 8 years ago

So given above generated code, how does one tinker with preprocessors (add or remove a preprocessing step, or tweak a parameter) in the code to see how it influences further results? Easy tinkering is, or should be, one of the main goals, I suppose.

Ameobea commented 8 years ago

@Kernc I've added setting loading for most of the widgets that I've implemented so far to allow widgets to both be automatically configured from within the script as well as to prevent saved settings from being loaded and overriding widget defaults.

Here's a new look at the preprocessor widget showing the changes:

#
#preprocess1
#
preprocs = [None] * 2
preprocs[0] = ('orange.preprocess.randomize', {'rand_type': 1})
preprocs[1] = ('orange.preprocess.impute', {})
input_data = file0_data
ow = OWPreprocess()
ow.storedsettings = {"preprocessors": preprocs, "name": ""}
ow._initialize()
ow.set_data(input_data)
preprocessor = ow.buildpreproc()
if input_data is not None:
    try:
        data = preprocessor(input_data)
    except ValueError as e:
        print("Error preprocessing data.")
else:
    data = None
ow.set_data(None)
ow.handleNewSignals()
ow.saveSettings()
ow.onDeleteWidget()
preprocess1_preprocessor = preprocessor
preprocess1_preprocessed_data = data

The list of preprocessors are read out of the widget settings at code generation time and inserted into the output. They are then loaded into the preprocessor that is created during the script execution, overriding whatever may be persisted from other runs.

I really do understand importance of making the generated scripts tweakable and extensible (why else would one want to create one?) and I will work towards making every widget's generated code as close to that as possible. However, I really want to stick with the code generate approach rather than coding in a main function for the widget. I feel that it would be extremely difficult to create a function that could be plugged into the output script as-is with inspect due to the differences in code that exist between the two environments. I still hold the general idea of creating a main function that does the work of the widget, but so far I've found it best to pass it through the code generator to add stuff such as dynamic declarations and conditional function calls, etc.

kernc commented 8 years ago

If there must be a separate, duplicate code-generating function, why not have that code template be as optimized and as readable as possible? Why wouldn't OWPreprocess widget rather export to something like:

from Orange.preprocess import PreprocessorList
from Orange.preprocess import Impute, Normalize

input_data = file0_data
preprocess1_preprocessor = PreprocessorList([
    Impute(),
    Normalize(),
])
preprocess1_preprocessed_data  = preprocess1_preprocessor(input_data)

This is something a simplish

def as_code_string(self):
    return '''some {templated} string'''.format(
        all sorts of parameters gathered from around the self)

widget method could do. And the benefits of the latter generated example are quite obvious, I hope.

Ameobea commented 8 years ago

I first want to say that I am sorry if it seems that I am going against what you have in mind for the project and using methods that are suboptimal. The reason for this is probably more due to my personal weak points as a programmer and in understanding the Orange project as a whole than anything else.

My goal is not to create a string generator function for each widget that returns a pre-formatted and self-contained string to be concatenated to the main output script. Rather I want to provide a set of rules and data to send to a code generator that modifies a base function, adds things like declarations, links the inputs/outputs, and handles things like transferring from running code from inside a widget class with self to dealing with the widget as an external object or, as in your example, without constructing a widget at all.

I wholeheartedly plan to optimise and trim down both the generated code and the code responsible for generating it going forward. Creating readable, maintainable, and non-bloated code is my first priority and I will in no case try to merge any code that you or other members of the project find unacceptable or could be done better.

That being said, I ask that you take the versions of the code generation process presented in these updates as very WIP. I've been focusing on getting the code generator and the script output process as a whole in line before focusing on optimizing individual widgets. I sincerely believe that given more time I will be able to present a version that both fulfils all of Prof. Schmitt's requirements and meets your expectations for design and quality. Thank you for your patience with me on this, and thank you for taking the time to give me such valuable input!

Ameobea commented 8 years ago

I've improved the preprocessor code per your reccomendations @kernc:

#
#preprocess1
#
ow = OWPreprocess()
plist = [None] * 3
plist[0] = ow._qname2ppdef["orange.preprocess.randomize"].viewclass.createinstance(params={'rand_type': 0})
plist[1] = ow._qname2ppdef["orange.preprocess.impute"].viewclass.createinstance(params={'method': 5})
plist[2] = ow._qname2ppdef["orange.preprocess.scale"].viewclass.createinstance(params={'center': 1, 'scale': 0})

input_data = file0_data
preprocessor = PreprocessorList(plist)
data = preprocessor(input_data)
preprocess1_preprocessed_data = data
preprocess1_preprocessor = preprocessor

I still had to make an instance of the widget so I could get access to the _qname2ppdef function that was, as far as I could see, the only way to actually get the preprocessor function. There are probably ways to do this better, but after >1 hour of searching through the codebase this is the best I could find.

As always, I'm open to suggestions and feedback on what I've been doing. If the current code generator concept as a whole proves to not be acceptable, I'll do whatever is necessary to make it work.

Pelonza commented 8 years ago

@Ameobea we (I) understand it's all a WIP... we're just continuing to try and get it into both our visions of usefulness and clarity. I'm not sure what @kernc 's expectations were, but you are doing just fine by mine!

See: https://github.com/biolab/orange3/blob/master/Orange/preprocess/preprocess.py

For the preprocessing... If you are able to extract that you need to do "impute" and the params, why can't you just create a preprocessor directly with that? Seems like you should be able to make a preprocessor with: plist[0] = orange.preprocess("impute",data=None, params={'method':5}

I might have the actual python syntax a bit wrong for making an instance ... but the code is certainly in preprocess.py to directly make a preprocessor without resorting to the qname2ppdef.

Note that "Scale" is actually "Normalize" in the preprocess.py code...sort of.

@kernc The fact that the preprocess widget actually does directly a "scale" preprocess suggests there ought to be a direct analog in the pure orange library. And I don't see one when I'm looking through the preprocess folder. Neither normalize by Span or Normalize by SD is exactly the same. These are really "shifts" and "scaling"

It has a impute and randomize ability.

Ameobea commented 8 years ago

The Orange.preprocess constructor takes a class as its first argument, not a string. The data the widget supplies is a qname for each of the selected preprocessors, orange.preprocess.pca for example. What _qname2ppdef does is look up the preprocessor in an internal list of preprocessors to get the editor function that will return a create it. This editor contains a function called createinstance that actually creates the instance of the preprocessor, in the most minimal of cases calling Impute() or another imported preprocessing function.

However, sometimes the createinstance function is much more complex. Here's an example of it for Impute:

    def createinstance(params):
        params = dict(params)
        method = params.pop("method", ImputeEditor.Average)
        if method == ImputeEditor.NoImputation:
            return None
        elif method == ImputeEditor.Average:
            return preprocess.Impute()
        elif method == ImputeEditor.Model:
            return preprocess.Impute(method=preprocess.impute.Model())
        elif method == ImputeEditor.DropRows:
            return _RemoveNaNRows()
        elif method == ImputeEditor.DropColumns:
            return preprocess.RemoveNaNColumns()
        else:
            method, defaults = ImputeEditor.Imputers[method]
            defaults = dict(defaults)
            defaults.update(params)
            return preprocess.Impute(method=method)

As you can see, the actual process by which Orange creates and executes these preprocessors is rather complicated and would be very difficult to inline into the output script. If you wanted to drop the advanced preprocessor generation functionality and just use the base case of preprocess.Impute(method=method), there's still the problem of converting orange.preprocess.impute into Impute(), orange.preprocess.pca into Orange.projection.PCA(), etc. As I said before, there may be a way to do this that I don't know about, but from what I know using _qname2ppdef() is the best option.

Pelonza commented 8 years ago

Can you unwrap what gets sent into the widget's "preprocessor" variable? or actually call (in your hidden generator) some of the widget's construction functions.

I guess what I'd like to see in the final output code is more like:

plist[0] = preprocess.RemoveNaNColumns()
plist[1] = ProjectPCA(n_components)
...
preprocessor=PreProcessorList(plist)

etc....

I can see why it's a lot harder to unwrap those specific calls though. Because they are wrapped up in a "createinstance" function inside the specific class, which is accessible from the viewclass.

Could you actually parse through the stack-calls made by the widget? As I read the code, it seems like the call order/stack is:

User hits "Apply"--> Call OWPreProcess.apply(self) Call OWPreProcess.buildpreproc(self) (for each preprocessormodel): plist.append(create(params) ) --> This is calling OWPreprocess..createinstance Call OWPreprocess..createinstance

(interrupt here, and trace logic to pull out what gets returned?)

You want the guts of the "createinstance" where you basically output to the generated code what gets returned from any given instance of "createinstance" but with all the parameters actually input explicitly.

Ameobea commented 8 years ago

Yeah, I do understand what you're envisioning with this and agree that it would be an improvement over what I currently have. Tracing through the execution stack and/or parsing and extracting information from the source code by traversing the AST were both methods I considered during the development of the code generator. However, I passed them up due to the great complexity and amount of code they would require.

Although I'm sure it would be possible to implement such advanced features, it would require a lot of lines of code to make it work and a correspondingly large amount of time to write it. Simpler methods of doing this also exist, but those run into problems with duplicating code and creating difficult to maintain generator functions.

In addition, this is only one widget in a project with many dozens. If you feel that this is an important feature that you'd like me to focus on, of course I'll do so. However, I think it's important to keep in mind the project at large and perhaps prioritize such enhancements behind other tasks such as getting code generation working for a wider range of widgets.

Ameobea commented 8 years ago

I also created a WIP Pull Request on the main Orange project with the purpose of getting some more feedback on my code generator design: https://github.com/biolab/orange3/pull/1449

I just wanted to see if the current method I have of generating code is acceptable and if it would be considered a valid candidate for merging into the main project.

Pelonza commented 8 years ago

Ok. Sounds good! I do agree. What you've got now is at least readable/understandable. And modifiable.

Let's see what the rest of the Orange group thinks in regards to it. I'd definitely rather have a "complete" working output to python than a few perfectly implemented things.

That said.... sometimes it's harder to go back and make bigger sweeping changes if needed.

kernc commented 8 years ago

Sorry, been away a few days preparing my thesis defense. It went well.

there may be a way to do this that I don't know about

It's a question of separation of concerns (that Orange admittedly doesn't see enough). Is the code generation logic really the best place to convert preprocessor instances into their respective Orange.preprocess.Something(params) calls? We've discussed it a bit here, and we feel like our basic structures like learners and preprocessors should define their __repr__() methods to emit a clear representation of themselves. Compare:

>>> from Orange.preprocess import Impute
>>> Impute()
<Orange.preprocess.preprocess.Impute at 0x7f6853583550>

>>> from sklearn.naive_bayes import BernoulliNB
>>> BernoulliNB()
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

This way, you only need to call repr(preprocessor_instance) to serialize a readable, reconstructible, modifiable, perfect representation of that preprocessor.

See __repr__() docs:

If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment).

So there's some work there, but it's fairly easy work. :smiley:

Ameobea commented 8 years ago

Glad to hear that your thesis defence preparations went well.

That looks like a great solution; I'll prioritize work on that since it seems to be have an integral role in the construction of the rest of the code generator. I agree that this will fit very well with the code generation process.

In the meantime, I've been working to optimize the code generator and reduce code clutter wherever possible. Some of the ideas have turned out to make it more complicated than less and I've scrapped them. I want to be sure that what I'm doing will work in the long-term before devoting significant effort to something subpar.

Thanks again for your continued support and guidance!