Errors for UNet3D application on distconv LBANN

JBae2 commented 2 years ago

Hello, I am trying to run the supported UNet3D aplication code in the LBANN github, but it fails.

In the distconv environments and its related source codes, it looks like that the input with "labels" data_field is not supported yet. The source code also mentioned that Distconv currently only supports CosmoFlow data.

Is this possible to run unet3d application on LBANN or am I missing something? If you have a knowledge, please advise about it.

This is the main function of my source code that I modified from the example unet3d. The omitted functions are same with the original. Thank you.

if __name__ == '__main__':
    desc = ('Construct and run the 3D U-Net on a 3D segmentation dataset.'
            'Running the experiment is only supported on LC systems.')
    parser = argparse.ArgumentParser(description=desc)
    lbann.contrib.args.add_scheduler_arguments(parser)

    (Omit parser.add_argument section)

    lbann.contrib.args.add_optimizer_arguments(
        parser,
        default_optimizer="adam",
        default_learning_rate=0.001,
    )

    args = parser.parse_args()
    args.procs_per_node=4

    parallel_strategy = get_parallel_strategy_args(
        sample_groups=args.mini_batch_size,
        depth_groups=args.depth_groups)

    # Construct layer graph
    volume = lbann.Input(data_field='samples')
    segmentation = lbann.Input(data_field='labels')

    output = UNet3D()(volume)

    ce = lbann.CrossEntropy([output, segmentation])
    layers = list(lbann.traverse_layer_graph([volume, segmentation]))

    obj = lbann.ObjectiveFunction([ce])

    for l in layers:
        l.parallel_strategy = parallel_strategy

    # Setup model
    metrics = [lbann.Metric(ce, name='CE', unit='')]
    callbacks = [lbann.CallbackPrint(),
        lbann.CallbackTimer(),
        lbann.CallbackGPUMemoryUsage(),
        lbann.CallbackProfiler(skip_init=True),
    ]
    # # TODO: Use polynomial learning rate decay (https://github.com/LLNL/lbann/issues/1581)
    # callbacks.append(
    #     lbann.CallbackPolyLearningRate(
    #         power=1.0,
    #         num_epochs=100,
    #         end_lr=1e-5))
    model = lbann.Model(epochs=args.num_epochs,
        layers=layers,
        objective_function=obj,
        callbacks=callbacks,
    )

    # Setup optimizer
    optimizer = lbann.contrib.args.create_optimizer(args)

    # Setup data reader
    data_reader = create_unet3d_data_reader(
        train_dir=args.train_dir,
        test_dir=args.test_dir)

    # Setup trainer
    trainer = lbann.Trainer(mini_batch_size=args.mini_batch_size)

    # Runtime parameters/arguments
    environment = lbann.contrib.args.get_distconv_environment(
        num_io_partitions=args.depth_groups)
    if args.dynamically_reclaim_error_signals:
        environment['LBANN_KEEP_ERROR_SIGNALS'] = 0
    else:
        environment['LBANN_KEEP_ERROR_SIGNALS'] = 1
    lbann_args = ['--use_data_store']

    # Run experiment
    kwargs = lbann.contrib.args.get_scheduler_kwargs(args)
    lbann.contrib.launcher.run(
        trainer, model, data_reader, optimizer,
        job_name=args.job_name,
        environment=environment,
        lbann_args=lbann_args,
        batch_job=args.batch_job,
        **kwargs)

bvanessen commented 2 years ago

@JBae2 There is a bug in the current UNet3D model, where the python representation of the model has drifted from some of the internal changes that have occurred in LBANN. This issue is currently being worked in PR #2151 but is not yet complete.

benson31 commented 1 year ago

@bvanessen Can this be closed as #2151 is now merged?

LLNL / lbann

Errors for UNet3D application on distconv LBANN #2156