window 10 下执行“python ./main/train.py” 报错“AttributeError: Can't pickle local object 'GeneratorEnqueuer.start.<locals>.data_generator_task'”

yuzy007 commented 5 years ago

新人求助！！！ 环境： OS：Windows 10 Python：3.6 tf：tf-gpu 报错内容如下： (tf-gpu) C:\Users\yuzy0\Downloads\text-detection-ctpn-banjin-dev>python ./main/train.py C:\ProgramData\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\ops\gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " WARNING:tensorflow:Variable Conv/weights missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable Conv/biases missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable BiLSTM/bidirectional_rnn/fw/lstm_cell/kernel missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable BiLSTM/bidirectional_rnn/fw/lstm_cell/bias missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable BiLSTM/bidirectional_rnn/bw/lstm_cell/kernel missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable BiLSTM/bidirectional_rnn/bw/lstm_cell/bias missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable BiLSTM/weights missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable BiLSTM/biases missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable bbox_pred/weights missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable bbox_pred/biases missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable cls_pred/weights missing in checkpoint data/vgg_16.ckpt WARNING:tensorflow:Variable cls_pred/biases missing in checkpoint data/vgg_16.ckpt 2019-03-14 00:27:56.149701: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2019-03-14 00:27:57.205599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1070 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.2655 pciBusID: 0000:01:00.0 totalMemory: 8.00GiB free2019-03-14 00:27:57.218265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-03-14 00:27:57.685618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-14 00:27:57.689994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-03-14 00:27:57.693261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-03-14 00:27:57.696260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6553 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070 wit h Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1) continue training from previous checkpoint 50000 Traceback (most recent call last): File "./main/train.py", line 117, in tf.app.run() File "C:\ProgramData\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "./main/train.py", line 93, in main data = next(data_generator) File "C:\Users\yuzy0\Downloads\text-detection-ctpn-banjin-dev\utils\dataset\data_provider.py", line 83, in get_batch enqueuer.start(max_queue_size=24, workers=num_workers) File "C:\Users\yuzy0\Downloads\text-detection-ctpn-banjin-dev\utils\dataset\data_util.py", line 60, in start thread.start() File "C:\ProgramData\Anaconda3\envs\tf-gpu\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\ProgramData\Anaconda3\envs\tf-gpu\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\ProgramData\Anaconda3\envs\tf-gpu\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\ProgramData\Anaconda3\envs\tf-gpu\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "C:\ProgramData\Anaconda3\envs\tf-gpu\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'GeneratorEnqueuer.start..data_generator_task'

(tf-gpu) C:\Users\yuzy0\Downloads\text-detection-ctpn-banjin-dev>Traceback (most recent call last): File "", line 1, in File "C:\ProgramData\Anaconda3\envs\tf-gpu\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "C:\ProgramData\Anaconda3\envs\tf-gpu\lib\multiprocessing\spawn.py", line 115, in _main self = reduction.pickle.load(from_parent)

bagpeng commented 5 years ago

did you slove the problem,

zhangsn828 commented 5 years ago

did you slove the problem,

yuzy007 commented 5 years ago

did you slove the problem,

It seems to that you can not pass a literator into the class named "GeneratorEnqueuer()" when using the model of "multiprocessing". So you can divide the class into several functions to avoid the problem. However, the code will be really bad readable . Maybe you can have a good solution of this problem.

Sorry for my bad English.

ZhangYi0810 commented 5 years ago

did you slove the problem,

It seems to that you can not pass a literator into the class named "GeneratorEnqueuer()" when using the model of "multiprocessing". So you can divide the class into several functions to avoid the problem. However, the code will be really bad readable . Maybe you can have a good solution of this problem. Sorry for my bad English.

can you give me your new functions about this question qq1020290041 thank you

litchi99 commented 5 years ago

在win10上折腾了一整天，发现是python多线程的问题。使用断点调试，错误定位于/utils/dataset/data_util.py (53lines)： thread = multiprocessing.Process(target=data_generator_task) 将data_generator_task打印出来，得到如下结果： <function GeneratorEnqueuer.start..data_generator_task at 0x000001C0BE358E18> 个人推测的原因： ./main train.py 运行时，会将dataset的image与label读取到内存中，并给出内存地址如0x000001C0BE358E18 data = next(data_generator)则是根据初始内存地址中递增，从而读取下一批的训练数据但由于python无法从内存地址中调用next(data_generator)，所以无法正常运行train.py 建议：转linux平台，10分钟搞掂。

15727652201 commented 5 years ago

在win10上折腾了一整天，发现是python多线程的问题。使用断点调试，错误定位于/utils/dataset/data_util.py (53lines)： thread = multiprocessing.Process(target=data_generator_task) 将data_generator_task打印出来，得到如下结果： <function GeneratorEnqueuer.start..data_generator_task at 0x000001C0BE358E18> 个人推测的原因： ./main train.py 运行时，会将dataset的image与label读取到内存中，并给出内存地址如0x000001C0BE358E18 data = next(data_generator)则是根据初始内存地址中递增，从而读取下一批的训练数据但由于python无法从内存地址中调用next(data_generator)，所以无法正常运行train.py 建议：转linux平台，10分钟搞掂。

兄弟有没有办法不转linux啊？怎么解决呢

highinsky commented 5 years ago

换成单线程执行就可以了

eragonruan / text-detection-ctpn

window 10 下执行“python ./main/train.py” 报错“AttributeError: Can't pickle local object 'GeneratorEnqueuer.start.<locals>.data_generator_task'” #314