PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.13k stars 5.55k forks source link

API Simplification #10248

Closed wangkuiyi closed 6 years ago

wangkuiyi commented 6 years ago

Our current implementation of Fluid is incomplete and exposed too many details. A consequence is that Fluid applications are lengthy and incomprehensive.

Let us target for a cleanup and simplification

wangkuiyi commented 6 years ago

The Current Problems

  1. Concepts like Executor should be hidden.
  2. Most complicated staff are in the train-loop, and encapsulate the loop in a train function.

For more details, let us take a look at the current example program fluid/test_fit_a_line.py, which has the following structure:

  1. Define the forward pass

    https://github.com/PaddlePaddle/Paddle/blob/83b1a8f6bf295fefcf44949d17b538de47eb522e/python/paddle/fluid/tests/book/test_fit_a_line.py#L26-L33

  2. Generate the backward pass

    https://github.com/PaddlePaddle/Paddle/blob/83b1a8f6bf295fefcf44949d17b538de47eb522e/python/paddle/fluid/tests/book/test_fit_a_line.py#L35-L36

  3. Create the reader

    https://github.com/PaddlePaddle/Paddle/blob/83b1a8f6bf295fefcf44949d17b538de47eb522e/python/paddle/fluid/tests/book/test_fit_a_line.py#L38-L43

  4. Run the startup program

    https://github.com/PaddlePaddle/Paddle/blob/83b1a8f6bf295fefcf44949d17b538de47eb522e/python/paddle/fluid/tests/book/test_fit_a_line.py#L50

  5. Run the Python train-loop and calls the main program

    https://github.com/PaddlePaddle/Paddle/blob/83b1a8f6bf295fefcf44949d17b538de47eb522e/python/paddle/fluid/tests/book/test_fit_a_line.py#L52-L57

wangkuiyi commented 6 years ago

A Proposal

This idea came from @emailweixu . Here is a brief description with example code.

  1. Let us encapsulate the forward pass into a Python function:

    def F():
      x = fluid.layers.data(...)
     ...
     avg_cost = fluid.layers.mean(...)
  2. Let us invent a standard Fluid function fluid.train, which encapsulates the creation of the reader, the train-loop, and the generation of backward pass:

    def train(F, ...):
     F()  # fills in startup_program and main_program
     exe = fluid.Executor(...)
     exe.run(startup_program)
     for iter in xrange(1000):
       exe.run(main_program)
  3. So, the users could rewrite test_fit_a_line.py as

    def F():
     x = ...
     ...
     avg_cost = ...
    
    train(F, ...)
reyoung commented 6 years ago

Following the proposal, an example train script could be

import paddle.fluid as paddle
import paddle.v2.dataset as dataset

def conv_network():
    image = fluid.layers.data(name='image', shape=[1, 28, 28])
    label = fluid.layers.data(name='label', shape=[1], dtype='int64')

    hidden = fluid.layers.simple_img_conv_pool(image, 
            num_filters=32, filter_size=3,
            pool_size=3, pool_stride=1, act='relu')
    hidden = fluid.layers.dropout(hidden, 0.1)
    hidden = fluid.layers.batch_norm(hidden)

    prediction = fluid.layers.fc(hidden, size=10, act='softmax')
    loss = fluid.layers.cross_entropy(prediction, label)

    return loss

def main():
    trainer = fluid.Trainer(conv_network, optimizer=fluid.optimizer.SGD())

    def event_handler(event):
        if isinstance(event, fluid.EndIteration):
            print event.metrics
        elif isinstance(event, fluid.EndPass):
            test_metrics = trainer.test(reader=dataset.mnist.test())
            print test_metrics

    trainer.train(reader=dataset.mnist.train(), num_pass=100)

if __name__ == '__main__':
    main()
panyx0718 commented 6 years ago

For common models, this skeleton looks good. (Still need more polish and thinking)

Overall, I think we need to have 2 level of APIs: high level and low level

high level API simplifies the network construction for normal models, such as ResNet, LSTM. high level API is built on top of low level APIs.

however, we need to be sure that user still has the flexibility of building complex models with our low level (more fine-grained) APIs.

One last point: When our design is more stable, we need ask our modeling team member (qingqing, yaming, yibing, etc) for advice. We need to make sure our API has a good coverage of current and future models.

JiayiFeng commented 6 years ago

I think the key problem makes current Fluid hard to use is that users can hardly understand our 'program'. Furthermore, in Fluid most features require more than one program. For example, if a user needs to do inference on test data every 10 training batches, he has to build and maintain two programs: the one for training and another one for test. Most users know neither why there should be two programs nor how to correctly build them.

In my view, the most exciting point of this issue's proposal is to warp user's net config in a function and then pass the function to some other objects. Based on this idea, maybe we can introduce a conception of ProgramBuilder. A ProgramBuilder takes a forwarding net config function defined by users(F() in the demo), and adds complementary ops(optimizers, gradient ops...) to generate specific programs(training program, testing program, and so on). Programs are built and maintained by ProgramBuilder automatically. A trainer can take a ProgramBuilder and execute the corresponding program.

In this method, users no longer need to understand programs, for they will not directly use them anymore.

By the way, in the proposed design, how to support GAN?

wangkuiyi commented 6 years ago

@JiayiFeng It seems that we need to allow users to write the train-loop. (I was taking the PyTorch version as a reference.) I am afraid that this simplified API cannot make it, and we might want it in the next milestone. What do you think?

emailweixu commented 6 years ago

Clearly, this high level API cannot satisfy all needs (e.g. reinforcement learning, GAN). The current V2 API cannot either. It might be possible to tweak a little bit (say, combining model and optimizer as one to pass to trainer) to make GAN possible. We need to think about to what level we can clean up the low-level API to support user train-loop in python.

abhinavarora commented 6 years ago

@reyoung Do you have any suggestion on how Inference will work with the paradigm that you have shared? I am not sure if this API style will be compatible with the inference engine work done in Q1.

helinwang commented 6 years ago

How about this? I think it can support GAN:

import paddle.fluid as fluid
import paddle.v2.dataset as dataset

def conv_network():
    image = fluid.layers.data(name='image', shape=[1, 28, 28])
    label = fluid.layers.data(name='label', shape=[1], dtype='int64')

    hidden = fluid.layers.simple_img_conv_pool(image,
                                               num_filters=32, filter_size=3,
                                               pool_size=3, pool_stride=1, act='relu')
    hidden = fluid.layers.dropout(hidden, 0.1)
    hidden = fluid.layers.batch_norm(hidden)

    prediction = fluid.layers.fc(hidden, size=10, act='softmax')
    loss = fluid.layers.cross_entropy(prediction, label)

    return loss

def train_conv_network():
    loss = conv_network()
    sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
    sgd_optimizer.minimize(loss)
    return loss

def main():
    # `fluid.Compile` creates a program,
    # the program owns the program desc, and a single scope.
    # Because the scope is shared by different methods (`conv_network`, `train_conv_network`),
    # GAN should be supported.
    program = fluid.Compile(conv_network, train_conv_network)
    for i in range(0, 100):
        for train_data in dataset.mnist.train():
            loss = program.run("train_conv_network", {"image": train_data[0], "label": train_data[1]})
            print("train loss", loss)

        for test_data in dataset.mnist.test():
            loss = program.run("conv_network", {"image": test_data[0], "label": test_data[1]})
            print("test loss", loss)

if __name__ == '__main__':
    main()
emailweixu commented 6 years ago

@helinwang how do your proposal handle distributed training?

helinwang commented 6 years ago

@emailweixu for trainer, the fluid.Compile can check the environment variable to distinguish if it's distributed training and produce the correct compiled program:

TRAINING_ROLE=TRAINER PSERVERS=127.0.0.1:8000 python train.py

For pserver, the user can do something like:

TRAINING_ROLE=PSERVER paddle run --file train.py --main train_conv_network

The key is that the entry point is no longer Python, instead it's the paddle binary, which parses the train_conv_network function into a pserver program, and run it.

helinwang commented 6 years ago

@emailweixu maybe a simpler way to start pserver is:

TRAINING_ROLE=PSERVER python train.py

And now fluid.Compile detects it's the PSERVER environment variable, produces a program that program.run will run the pserver operators.

emailweixu commented 6 years ago

@helinwang The problem is that pserver does not have a loop over data generator. With your design, user code has to do different things depends on whether it is pserver or trainer.

pkuyym commented 6 years ago

Awesome discussion. I have some naive thoughts, for some complicated networks, we should well handle the naming things. For example, fc layers may appear anywhere, I think auto-naming mechanism is not enough, event that we can pass a specified parameter name, however I think we can design better.

with net.module('generator') as generator:
    data1
    ...
    with network.name_scope('scope') as sub_scope:
        fc1 = fluid.layer.fc(...)
    ...

with net.module('discrimitor') as discrimitor:
    data1 
    data2
    ...
    fc2 = fluid.layer.fc(input=generator.sub_scope.fc1)
    ...

network.module holds a complete logic block. We may analysis the dependencies to decide whether compile one ProgramDesc or more one ProgramDesc. We can require that all computation logic within a module or name_scope will share a naming space.

reyoung commented 6 years ago

@pkuyym Acutally, fluid supports name scope right now. fluid.unique_name.guard(). Basically as the same API as you proposed.

pkuyym commented 6 years ago

@reyoung Thanks for your reminder, I paste a snippet here:

with fluid.unique_name.guard():
        train_file_obj = fluid.layers.open_files(
            filenames=TRAIN_FILES,
            pass_num=pass_num,
            shapes=[[-1, 1], [-1, 1]],
            lod_levels=[1, 0],
            dtypes=['int64', 'int64'],
            thread_num=1)

I think it may make the API more friendly to extend current unique_name.guard to support:

# add prefix to make debug easier
with fluid.unique_name.guard('prefix_1') as scope_1:
    fc = fluid.layers.fc(...)

with fluid.unique_name.guard('prefix_2') as scope_2:
   fc = fluid.layers.fc(input=scope_1.fc) # very convenient to refer fc in scope_1
JiayiFeng commented 6 years ago

@wangkuiyi In my opinion, even in GANs, multiple nets have a certain running order. So maybe we can allow the trainer takes more than one net configs(in the form of a list), generates multiple sets of programs, and use a for-loop inside the trainer to execute them in turn.

This idea is similar to @helinwang 's proposal. However, @helinwang proposes to compile all nets into a single program. I tend to assign every net with an independent program.

helinwang commented 6 years ago

The problem is that pserver does not have a loop over data generator. With your design, user code has to do different things depends on whether it is pserver or trainer.

@emailweixu thanks for pointing out, that is correct. Another possibility is we do it in fluid.Compile: when running as pserver, fluid.Compile will compile the pserver program, and run it immediately.

Still, it's somewhat not satisfactory because the users may have done something in the Python code before fluid.Compile with the assumption that it is used for training, not for running the pserver. I think reusing fluid.train for the entry point of running pserver operators arguably has the same issue.

The most clean way I think is to "extract" out the Fluid program definition code from the Python glue code. And run only the Fluid program definition code. According to this logic, one way would be doing:

# assuming train.py is in the same folder
paddle run_pserver --main train.train_conv_network

Internally paddle run_pserver does something like:

import os
import paddle.fluid as fluid
import train

os.environ['TRAINING_ROLE'] = "PSERVER"
program = fluid.Compile(train.train_conv_network) # transpile happens inside
program.run()
cs2be commented 6 years ago

All, we did some thinking about how inference can be done. Please review our proposal:

import paddle.fluid as paddle
import paddle.v2.dataset as dataset

def inference_network():
    image = fluid.layers.data(name='image', shape=[1, 28, 28])
    label = fluid.layers.data(name='label', shape=[1], dtype='int64')

    hidden = fluid.layers.simple_img_conv_pool(image, 
            num_filters=32, filter_size=3,
            pool_size=3, pool_stride=1, act='relu')
    hidden = fluid.layers.dropout(hidden, 0.1)
    hidden = fluid.layers.batch_norm(hidden)

    prediction = fluid.layers.fc(hidden, size=10, act='softmax')
    return prediction

def train_network():
    prediction = inference_network()
    loss = fluid.layers.cross_entropy(prediction, label)
    return loss

def main():
    params = fluid.Params('./params')
    # If params is not None it will be loaded to Trainer
    trainer = fluid.Trainer(train_network, optimizer=fluid.optimizer.SGD(), params=params)

    def event_handler(event):
        if isinstance(event, fluid.EndIteration):
        print event.metrics
    elif isinstance(event, fluid.EndPass):
        test_metrics = trainer.test(reader=dataset.mnist.test())
        print test_metrics

    # Train over 100 epochs
    trainer.train(reader=dataset.mnist.train(), 100, event_handler=event_handler)

    inferencer = fluid.Inferencer(inference_network, trainer.params)
    prediction = inferencer.infer({ 'image': <IMAGE_DATA>})

if __name__ == '__main__':
    main()
jetfuel commented 6 years ago

When we were trying to implement the Param class, we realized it was pretty ugly to implement with a share scope. Therefore we update the syntax to the following.

import paddle.fluid as paddle
import paddle.v2.dataset as dataset

def inference_program():
    image = fluid.layers.data(name='image', shape=[1, 28, 28])

    hidden = fluid.layers.simple_img_conv_pool(image,
                                               num_filters=32, filter_size=3,
                                               pool_size=3, pool_stride=1, act='relu')
    hidden = fluid.layers.dropout(hidden, 0.1)
    hidden = fluid.layers.batch_norm(hidden)

    prediction = fluid.layers.fc(hidden, size=10, act='softmax')
    return prediction

def train_program():
    label = fluid.layers.data(name='label', shape=[1], dtype='int64')

    prediction = inference_program()
    cost = fluid.layers.cross_entropy(prediction, label)
    avg_cost = fluid.layers.mean(cost)
    return avg_cost

def main():
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    trainer = fluid.Trainer(program_func=train_program, 
                                        optimizer=fluid.optimizer.SGD(),
                                        param_path="image.model",
                                        place=place)

    def event_handler(event):
        if isinstance(event, fluid.EndEpochEvent):
            pass
        elif isinstance(event, fluid.EndStepEvent):
            test_metrics = trainer.test(reader=test_reader)
            pass

    trainer.train(num_epochs=1,
                         event_handler=event_handler,
                         reader=train_reader,
                         feed_order=['image', 'label'])

    inferencer = fluid.Inferencer(inference_program, param_path="image.model", place=place)
    prediction = inferencer.infer({'image': < IMAGE_DATA >})

    if __name__ == '__main__':
        main()

I also noticed there is another design. The change is to have the Trainer to handle the infer_program. Is this version later than the above one?

def main():
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    trainer = fluid.Trainer(program_func=train_program,
                            infer_func=inference_program,
                            optimizer=fluid.optimizer.SGD(),
                            param_path="image.model",
                            place=place)

    def event_handler(event):
        if isinstance(event, fluid.EndEpochEvent):
            pass
        elif isinstance(event, fluid.EndStepEvent):
            test_metrics = trainer.test(reader=test_reader)
            pass

    trainer.train(num_epochs=1,
                         event_handler=event_handler,
                         reader=train_reader,
                         feed_order=['image', 'label'])

    inferencer = fluid.Inferencer(param_path="image.model", place=place)
    prediction = inferencer.infer({'image': < IMAGE_DATA >})
jetfuel commented 6 years ago

Latest Syntax

import paddle.fluid as paddle
import paddle.v2.dataset as dataset

def inference_program():
    image = fluid.layers.data(name='image', shape=[1, 28, 28])

    hidden = fluid.layers.simple_img_conv_pool(image,
                                               num_filters=32, filter_size=3,
                                               pool_size=3, pool_stride=1, act='relu')
    hidden = fluid.layers.dropout(hidden, 0.1)
    hidden = fluid.layers.batch_norm(hidden)

    prediction = fluid.layers.fc(hidden, size=10, act='softmax')
    return prediction

def train_program():
    label = fluid.layers.data(name='label', shape=[1], dtype='int64')

    prediction = inference_program()
    cost = fluid.layers.cross_entropy(prediction, label)
    avg_cost = fluid.layers.mean(cost)
    return avg_cost

def main():
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    trainer = fluid.Trainer(program_func=train_program,
                            optimizer=fluid.optimizer.SGD(),
                            param_path="image.model",
                            place=place)

    def event_handler(event):
        if isinstance(event, fluid.EndEpochEvent):
            trainer.save_inference_model("image.model")
        elif isinstance(event, fluid.EndStepEvent):
            test_metrics = trainer.test(reader=test_reader)
            pass

    trainer.train(num_epochs=1,
                         event_handler=event_handler,
                         reader=train_reader,
                         feed_order=['image', 'label'])

    inferencer = fluid.Inferencer(
                            infer_func=inference_program,
                            param_path="image.model", place=place)
                            prediction = inferencer.infer({'image': < IMAGE_DATA >})

    if __name__ == '__main__':
        main()
daming-lu commented 6 years ago

Based on the discussion here, we should follow the pattern here

shanyi15 commented 6 years ago

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持! Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!