Oneflow-Inc / OneFlow-Benchmark

OneFlow models for benchmarking.
104 stars 31 forks source link

Unable to complete model training #141

Closed fengyuchao97 closed 3 years ago

fengyuchao97 commented 3 years ago

Using the code updated today, no matter what model and data set is used, the following error will be reported:

Traceback (most recent call last): File "of_cnn_train_val.py", line 74, in @flow.global_function("train", get_train_config(args)) File "/home/qwe/oneflow/OneFlow-Benchmark-master/Classification/cnns/job_function_util.py", line 33, in get_train_config train_config = _default_config(args) File "/home/qwe/oneflow/OneFlow-Benchmark-master/Classification/cnns/job_function_util.py", line 28, in _default_config config.enable_fuse_add_to_output(True) File "/home/qwe/.local/lib/python3.5/site-packages/oneflow/python/framework/function_util.py", line 54, in getattr assert attr_name in name2default AssertionError

train.sh: rm -rf core. rm -rf ./output/snapshots/

DATA_ROOT=data/mini-imagenet/ofrecord

training with mini-imagenet

DATA_ROOT=data/mini-imagenet/ofrecord python3 of_cnn_train_val.py \ --train_data_dir=$DATA_ROOT/train \ --num_examples=50 \ --train_data_part_num=1 \ --val_data_dir=$DATA_ROOT/validation \ --num_val_examples=50 \ --val_data_part_num=1 \ --num_nodes=1 \ --gpu_num_per_node=1 \ --optimizer="sgd" \ --momentum=0.875 \ --learning_rate=0.001 \ --loss_print_every_n_iter=1 \ --batch_size_per_device=16 \ --val_batch_size_per_device=10 \ --num_epoch=10 \ --model="resnet50"

_job_functionutil: import oneflow as flow

def _default_config(args): config = flow.function_config() config.default_logical_view(flow.scope.consistent_view()) config.default_data_type(flow.float) if args.use_fp16: config.enable_auto_mixed_precision(True) if args.use_xla: config.use_xla_jit(True)

config.enable_fuse_add_to_output(True)

return config

def get_train_config(args): train_config = _default_config(args) train_config.cudnn_conv_heuristic_search_algo(False)

train_config.prune_parallel_cast_ops(True)
train_config.enable_inplace(True)
train_config.enable_fuse_model_update_ops(True)
return train_config

def get_val_config(args): return _default_config(args)

chengtbf commented 3 years ago

这个在最新的master上已经支持了enable_fuse_add_to_output 。但可能你运行的不是最新的master?可以把这行注释掉

# config.enable_fuse_add_to_output(True)
fengyuchao97 commented 3 years ago

这个在最新的master上已经支持了enable_fuse_add_to_output 。但可能你运行的不是最新的master?可以把这行注释掉

# config.enable_fuse_add_to_output(True)

感谢您,除注释enable_fuse_add_to_output之外,还注释train_config.enable_fuse_model_update_ops(True),可解决以上问题

chengtbf commented 3 years ago

感谢您,除注释enable_fuse_add_to_output之外,还注释train_config.enable_fuse_model_update_ops(True),可解决以上问题

嗯嗯,因为oneflow最近在最新的master上新增了很多优化参数,但是还没发布新版本(会在十一之后发布),所以版本之间会有一点小问题。发布新版本(0.2.0)之后就可以使用这些优化参数了,会比之前的版本性能更快。

jackalcooper commented 3 years ago

https://oneflow-static.oss-cn-beijing.aliyuncs.com/staging/master/2020.09.27-22.49.25-e210c8cb7/oneflow_cu102-0.2b1-cp35-cp35m-manylinux2014_x86_64.whl

https://oneflow-static.oss-cn-beijing.aliyuncs.com/staging/master/2020.09.27-22.49.25-e210c8cb7/oneflow_cu102-0.2b1-cp36-cp36m-manylinux2014_x86_64.whl

https://oneflow-static.oss-cn-beijing.aliyuncs.com/staging/master/2020.09.27-22.49.25-e210c8cb7/oneflow_cu102-0.2b1-cp37-cp37m-manylinux2014_x86_64.whl

https://oneflow-static.oss-cn-beijing.aliyuncs.com/staging/master/2020.09.27-22.49.25-e210c8cb7/oneflow_cu102-0.2b1-cp38-cp38-manylinux2014_x86_64.whl

jackalcooper commented 3 years ago

请试试看这些最新master版本编译的包