dmlc / minpy

NumPy interface with mixed backend execution
https://minpy.readthedocs.io/en/latest/
Other
1.11k stars 112 forks source link

Possible improvements about the wrapper #160

Open wddabc opened 7 years ago

wddabc commented 7 years ago

While minpy is a great tool for building dynamic nets, I found the wrapper (Solver, ModelBase...) is a little bit hard to use that I had to do a lot of hacks to make this work. Here are some thoughts about the possible improvements at the current stage:

  1. Lots of things should be passed into Solver.train rather than the constructor. Such as the data iterator, the optimizer. etc. In fact, the Solver.train should support more, such as whether do cross-validation, callback functions. I think a good template is the .fit function in Keras.
  2. As pointed out in #147, user should have flexibility writing their own training loop. So I think a public step is necessary.
  3. The definition of the final training loss is inflexible.

    def loss_func(*params): # pylint: disable=unused-argument
            """
            Loss function calculate the loss
            """
    
            # It seems that params are not used in forward function. But since we will pass
            # model.params as arguments, we are ok here.
            predict = self.model.forward_batch(batch, mode='train')
            return self.model.loss_batch(batch, predict)

    Assuming the data is x and the response is y, it is not necessary the case that the final loss can be decomposed to loss(model.predict(x), y) . In some cases it is just a single loss model.loss(x,y), which means no prediction is made at the training time. This is a common case in the structured prediction problems, such as conditional random fields.

sneakerkg commented 7 years ago

@wddabc Thanks a lot for the suggestion. We are going to refine the nn package to make it more convenient for users. For the loss_func part, what do you think we pre-define some loss or criterion template for users?

wddabc commented 7 years ago

I think directly returning self.model.loss_batch(batch) should be fine. Something looks like:

def loss_func(*params):
           return self.model.loss_batch(batch)

User can define whatever they like as the implementation of model.loss_batch(batch), for example, loss(model.predict(x), y). model.predict is optional, user can define or not based on whether they want to make prediction. For example, users might just use the model to pretrain some word embeddings, in which case model.predict is not necessary.

One thing should be careful is a wrapper on the top of loss_batch is still necessary, which will control whether to switch on the dropout/ regularizer. Might be defined inside loss_func? I'm not quite sure.

The dropout/ regularizer should be declared by user as arguments of .add_param() and be applied before calling self.params[name]. In fact, I'd suggest change self.params[name] to self.params(name), where the self.params() decides whether to apply the dropout based on whether it is in the training stage.

Here are some other issues should be considered for the new design (I think I'm basically proposing some Keras-thing):

  1. ModelBase looks like a standalone module that doesn't support calling other models. That is, you should define a huge graph within one ModelBase.

  2. Also, joint training different models with shared parameters is not supported in the current design. For this case, I think the right design is the parameters for each sub-module are globally transparent to each other.

sneakerkg commented 7 years ago

@wddabc Thanks a lot for the suggestions. We totally agree on the thought that flags like "isTrain" is needed when calling loss_batch; And: 1) For modelBase problem, now we have the concept Module in https://github.com/dmlc/minpy/blob/master/minpy/nn/model_builder.py . We can have multiple sub-module in a whole model; 2) Thanks for pointing out shared parameter issue. Could I bother you give us some suggestion on the interface for shared parameter functionality? 3) I'm not quite understand why we need to apply dropout on the params. Is it a common trick in NLP tasks? For vision and audio task, dropout is usually applied on the activations.

wddabc commented 7 years ago

On Thu, Mar 16, 2017 at 7:49 PM, Tianjun Xiao notifications@github.com wrote:

@wddabc https://github.com/wddabc Thanks a lot for the suggestions. We totally agree on the thought that flags like "isTrain" is needed when calling loss_batch; And:

  1. For modelBase problem, now we have the concept Module in https://github.com/dmlc/minpy/blob/master/minpy/nn/model_builder.py https://github.com/dmlc/minpy/blob/master/minpy/nn/model_builder.py . We can have multiple sub-module in a whole model;

It looks nice. I found there is a a module called Parallel which hasn't been implemented yet. Is this a version where you support appending multiple inputs? I'm actually thinking about making this more general by building a graph rather than a list (I mean Sequential).This is more flexible I think. Also, although it is a minor issue, I might consider renaming forward by call, as layer is basically a function of the input. This design is very close to the Functional API in Keras https://keras.io/getting-started/functional-api-guide/, which I like a lot. One major difference between minpy and Keras is the call is: In Keras, call is building a static graph and initialize the weight after shape inference. Minpy is different, as we are doing real computation! I think one idea is to initialize the weights lazily at the first call, when you will know the input shape. Then set a flag (say built) to true and skip this initialization in the following pass. This sounds a little bit hacky as the initialization shouldn't depends on the first input, for example, what if I just want to build the model and save it before passing data?

A less hacky way, though I think still hacky, is adding a build function and ask user to call before running. The build function does nothing but a dry-run though the graph. The idea is, using a global private flag to indicate wether it is a dry-run, if true, use the size of the input to initialize the weights, ignore the computation, just output a random value with the same size as the declared output. The reason we want to ignore the computation is 1) It might introduce the startup cost. 2) More serious, the user might define the graph on specific types of data, our random value might crash the program (For example, user might define division, and you pass a zero-value dummy matrix)

  1. Thanks for pointing out shared parameter issue. Could I bother you give us some suggestion on the interface for shared parameter functionality?

I think all the parameter is declared in the submodules but actually stored in a global dictionary. There are two ways to access these parameters, first is using local names by calling the local method. Another is directly look up the global dictionary with using global names. The global name is something like a concatenation of the module name and the local name. If the user would like to involve shared parameters, they should explicitly look up them by global name. I think this is flexible enough.

  1. I'm not quite understand why we need to apply dropout on the params. Is it a common trick in NLP tasks? For vision and audio task, dropout is usually applied on the activations.

This is called DropConnect http://www.matthewzeiler.com/pubs/icml2013/icml2013.pdf? In stead of dropout the hidden output, dropout some connections (weights) . Well, this is not quite frequent in NLP as well. But I remember somehow I did dropout on the entire word embedding to solve some out-of-vocabulary issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/minpy/issues/160#issuecomment-287226694, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdRO1CWhXeoO9OzP7dVTV3rOgav9Mr5ks5rmcp4gaJpZM4Ma1VI .

-- Best wishes

Dingquan Wang http://www.cs.jhu.edu/~wdd/ The Center for Language and Speech Processing Department of Computer Science Johns Hopkins University

wddabc commented 7 years ago

I think there is still flaw of the dry-run idea. I forgot the user might want to build graph themselves and shouldn't implement this. Well, this looks like a hard problem, as it needs static analysis on the dynamic graph. But the lazy init approach still seems to be viable. Or just let it go and let the user to take care of the shapes.

GaiYu0 commented 7 years ago

@wddabc Thank you very much for your suggestion! (I should have participated in discussion earlier)

The module Parallel has not been implemented because in the framework of model_builder it is difficult to provide a flexible interface for user to specify an operation on multiple outputs of Parallel. The difficulty is because a graph created by model_builder is essentially a "list", as you said. I agree that model_builder should support more general graph composition and if general graph composition is implemented it is going be simple for user to manipulate outputs of Parallel.

GaiYu0 commented 7 years ago

Resolving parameter sharing might be easier. I am implementing a initializer that enables user to index initialization configuration by parameter. For example, initializer[convolution0.weight].std = 0.1 specifies that the weight of convolution layer 0 should be sampled from a distribution using a standard deviation of 0.1. On this basis it is possible to implement an interface of the form initializer.share(parameters), which binds those parameters in initialization. No other modification is required for forward functions. What do you think?

GaiYu0 commented 7 years ago

Also, model_builder only supports static graph composition for the time being, i.e. statically declare a graph and use a function to convert the graph to a model. Dry-running is going to be an issue when model_builder supports dynamic graph composition in future, as PyTorch does. PyTorch forces users to declare all layers that is going to be used in dynamic graph composition in advance so that PyTorch can register related info, e.g. parameters. I am not sure whether PyTorch figures out parameter shape dynamically. It appears that the information provided in PyTorch layer declaration is sufficient for PyTorch to figure out shapes in advance. For example, the fully-connected layer in PyTorch is:

image

The shape of weight is (in_features, out_features), which can be inferred from layer declaration alone. model_builder only requires out_features, so the shape of weight must be inferred from data. Perhaps PyTorch-style dynamic graph composition is more feasible in term of difficulty of implementation?

wddabc commented 7 years ago

I agree, as I said in the last email. Maybe it is ok to just let users to take care of the shapes, just as PyTorch. The dynamic graph by itself is easy enough.

Regarding the shared parameter, I also agree that it is easy to fix. For the initialization, did you mean you want a separated initializer to take care of the initialization? I think specifying the initialization method as an argument of the parameter declaration is good enough, just as the current implementation does. As a user, I'm think about something like this.

class RNNNet(ModelBase):
    def __init__(self,
                 batch_size=100,
                 input_size=2,
                 hidden_size=64,
                 num_classes=1):
        super(RNNNet, self).__init__()
        self.add_param(name='Wx', shape=(input_size, hidden_size), init='uniform')\ #when using some random init, shape should be given
            .add_param(name='Wh', init=np.zeros(hidden_size, hidden_size))\ #when actual np ndarray should is given, shape can be ignored
            .add_param(name='b', share=global.parameter('convolution0:W') #shared parameter, shape can also be ignored. 'b' will be a local alias of 'convolution0:W'. 'convolution0:W' is a global name.

This is just from a user's prospective.

zzhang-cn commented 7 years ago

Hi all,

So far, we have focused on building a sequential network, but adding per-layer init and optimization rules, each of which can be dynamically changed. This adds flexibility and simplies logics.

I suggest us build upon these principles and study Keras ( https://keras.io/getting-started/functional-api-guide/) to complete functionality incrementally. Things like:

There's no rush to implement them right away, we can put all prototypes/signatures in a google doc and hash them out. What do you all think?

-zz

On Sat, Mar 18, 2017 at 9:32 AM, Dingquan notifications@github.com wrote:

I agree, as I said in the last email. Maybe it is ok to just let users to take care of the shapes, just as PyTorch. The dynamic graph by itself is easy enough.

Regarding the shared parameter, I also agree that it is easy to fix. For the initialization, did you mean you want a separated initializer to take care of the initialization? I think the specify the initialization method as an argument of the parameter declaration is good enough, just as the current implementation does. As a user, I'm think about something like this.

class RNNNet(ModelBase): def init(self, batch_size=100, input_size=2, hidden_size=64, num_classes=1): super(RNNNet, self).init() self.add_param(name='Wx', shape=(input_size, hidden_size), init='uniform')\ #when using some random init, shape should be given .add_param(name='Wh', init=np.zeros(hidden_size, hidden_size))\ #when actual np ndarray should is given, shape can be ignored .add_param(name='b', share=global.parameter('convolution0:W') #shared parameter, shape can also be ignored. 'b' will be a local alias of 'convolution0:W'. 'convolution0:W' is a global name.

This is just from a user's prospective.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dmlc/minpy/issues/160#issuecomment-287507227, or mute the thread https://github.com/notifications/unsubscribe-auth/AIKVUWplRxfHMIOHHNvWXogyC-IsH-8tks5rmzQtgaJpZM4Ma1VI .

--

Zheng Zhang https://shanghai.nyu.edu/academics/faculty/zheng-zhang 张峥

Professor of Computer Science, NYU Shanghai 上海纽约大学计算机系教授

Room 1118, 1555 Century Avenue

Pudong New District, Shanghai 200122, China

中国上海市浦东新区世纪大道1555号1118室, 200122

T Shanghai office 上海办公室 +86 021 2059 5687 T China mobile 中国手机 +86 156 1808 6522 E 电子邮件 zz@nyu.edu

jw5@nyu.edu

wddabc commented 7 years ago

Agree. I think one important thing is figuring out what functionality a Module should have (for example, load(), save(), fit(), predict() ...), and how these could be implemented under the context of dynamic graph. Starting from a sequential network is a good way.