apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.8k forks source link

When using fcn-xs (the example provided by @tornadomeet) for image segmentation, there is an error. Check failed: e == cudaSuccess CUDA: invalid device ordinal. #1051

Closed tybxiaobao closed 8 years ago

tybxiaobao commented 8 years ago

After following the installation guide of fcn-xs (https://github.com/tornadomeet/mxnet/tree/seg/example/fcn-xs), I have successfully used the code for training. But when I use the pre-trained model for image segmentation test, an error is reported with saying “./dmlc-core/include/dmlc/logging.h:208: [12:04:59] ./mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: invalid device ordinal”. The mshadow has been updated to the lastest one. The test is performed on a PC having one GPU card with 12G memory. Does anyone kown what's going on?

tornadomeet commented 8 years ago

please make sure the line https://github.com/tornadomeet/mxnet/blob/seg/example/fcn-xs/image_segmentaion.py#L31 is your correct gpu id; you can print information before and end of that line , to see whether it is broken there.

tybxiaobao commented 8 years ago

@tornadomeet It broken at the line https://github.com/tornadomeet/mxnet/blob/seg/example/fcn-xs/image_segmentaion.py#L46. And the L31 gets gpu(0), since I have only one GPU, gpu(0) is the correct id.

tornadomeet commented 8 years ago

i cannot reproduce the error here, from what your description, we can get that it broken when loadding the model, so which model do u use, your own trained model or the model i provided?

tybxiaobao commented 8 years ago

@tornadomeet I use the model you provided.

tornadomeet commented 8 years ago

@tybxiaobao the error message is :/mshadow/mshadow/./tensor_gpu-inl.h:35 and the code in tensor_gpu-inl.h:35 is

template<>
inline void SetDevice<gpu>(int devid) {
  MSHADOW_CUDA_CALL(cudaSetDevice(devid));
}

it says your devide number is not exist, that is invalid device ordinal so please check your code or somelse setting carefully. one way to test is to use cpu, but it will take about one minite for segmentation.

tybxiaobao commented 8 years ago

@tornadomeet Thanks. It's OK when I use my own trained model. So may be the pre-trained model is distroyed when I download it from the url link you provided.

zmonoid commented 8 years ago

Hi, by debugging, I found that the problem comes with this line:

... = mx.model.load_checkpoint(args.prefix, args.epoch)

The reason must be that the pretrained model FCN8s_VGG16-0019.params or VGG_FC_ILSVRC_16_layers-0074.params contains information of GPU device number which does not match the client side. Please correct if necessary.

tornadomeet commented 8 years ago

@zmonoid thanks for points this. i have update the FCN8s_VGG16-0019.params with cpu device, so you can test it.

zmonoid commented 8 years ago

Thanks very much for updating. May I ask is there anyway to manually change the parameter in .params file?

tornadomeet commented 8 years ago

@zmonoid to my know, there is no directly way for changging .param. i just load .params and save it with ctx=cpu using python script. and i think this is a bug of saving the model with GPU device information.

zmonoid commented 8 years ago

@tornadomeet Thank you, could you share the script with me? I need to change the GPU device info to gpu(0).

Also, I think it is more reasonable to delete the checking of GPU device information of load_checkpoint function, which causes this error directly.

tornadomeet commented 8 years ago

just a simply code like this:

import argparse
import mxnet as mx
import numpy as np
import logging
import symbol_fcnxs

workspace = 1536
ctx = mx.cpu()
def load_checkpoint(prefix, epoch):
    save_dict = mx.nd.load('%s-%04d.params' % (prefix, epoch))
    arg_params = {}
    aux_params = {}
    for k, v in save_dict.items():
        tp, name = k.split(':', 1)
        if tp == 'arg':
            arg_params[name] = mx.nd.array(v.asnumpy(), ctx)
        if tp == 'aux':
            aux_params[name] = mx.nd.array(v.asnumpy(), ctx)
    return (arg_params, aux_params)

def main():
    fcn8s = symbol_fcnxs.get_fcn8s_symbol(21, workspace)
    fcn8s_args, fcn8s_auxs = load_checkpoint(args.prefix, args.epoch)
    save_callback = mx.callback.do_checkpoint("FCN8s_VGG16-new")
    save_callback(args.epoch-1, fcn8s, fcn8s_args, fcn8s_auxs)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='training pascal voc segmentation using fcn-16s.')
    parser.add_argument('prefix', default='FCN8s_VGG16',
        help='The prefix(include path) of vgg16 model with mxnet format.')
    parser.add_argument('epoch', type=int, default=19,
        help='The epoch number of fcn16s model.')
    args = parser.parse_args()
    main()