YangZeyu95 / unofficial-implement-of-openpose

Implement of Openpose use Tensorflow
272 stars 79 forks source link

AssertionError: can only start a process object created by current process #78

Closed bilalkhann16 closed 2 years ago

bilalkhann16 commented 2 years ago

Hello, I'm trying to train the model on the COCO'17 dataset. When I try to run the train.py I get this error: AssertionError: can only start a process object created by current process. I'm currently using M1 Mac and Miniconda(Anaconda) for the envirornment. Here is the complete error:

(tf) bilalk@bilals-MacBook unofficial-implement-of-openposet_v2 % python3 train.py 
[2021-12-11 00:10:25,454] [train] [INFO] Namespace(annot_path='/Users/bilalk/Desktop/unofficial-implement-of-openposet_v2/COCO/annotations/', backbone_net_ckpt_path='checkpoints/vgg/vgg_19.ckpt', batch_size=10, checkpoint_path='/Users/bilalk/Desktop/unofficial-implement-of-openposet_v2/checkpoints', continue_training=False, hm_channels=19, img_path='/Users/bilalk/Desktop/unofficial-implement-of-openposet_v2/COCO/images/', input_height=368, input_width=368, loss_func='l2', max_echos=5, paf_channels=38, save_checkpoint_frequency=1000, save_summary_frequency=100, stage_num=6, train_vgg=True, use_bn=False)
[2021-12-11 00:10:25,454] [train] [INFO] checkpoint_path: /Users/bilalk/Desktop/unofficial-implement-of-openposet_v2/checkpoints2021-12-11-0-10-25
[2021-12-11 00:10:25,459] [train] [INFO] initializing data loader...
[2021-12-11 00:10:25,459] [pose_dataset] [INFO] dataflow img_path=/Users/bilalk/Desktop/unofficial-implement-of-openposet_v2/COCO/images/
loading annotations into memory...
Done (t=4.42s)
creating index...
index created!
[2021-12-11 00:10:30,112] [pose_dataset] [INFO] /Users/bilalk/Desktop/unofficial-implement-of-openposet_v2/COCO/annotations/ dataset 118287
[1211 00:10:30 @parallel.py:219] [MultiProcessRunner] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d.
[1211 00:10:30 @parallel.py:219] [MultiProcessRunner] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d.

ds <tensorpack.dataflow.parallel.MultiProcessRunner object at 0x28f4fc8b0>

enquerer <DataFlowToQueue(Thread-1, initial daemon)>

Here:  [<tf.Tensor 'fifo_queue_Dequeue:0' shape=(10, 368, 368, 3) dtype=float32>, <tf.Tensor 'fifo_queue_Dequeue:1' shape=(10, 46, 46, 19) dtype=float32>, <tf.Tensor 'fifo_queue_Dequeue:2' shape=(10, 46, 46, 38) dtype=float32>]

[2021-12-11 00:10:30,141] [pose_dataset] [INFO] dataflow img_path=/Users/bilalk/Desktop/unofficial-implement-of-openposet_v2/COCO/images/
loading annotations into memory...
Done (t=0.70s)
creating index...
index created!
[2021-12-11 00:10:30,850] [pose_dataset] [INFO] /Users/bilalk/Desktop/unofficial-implement-of-openposet_v2/COCO/annotations/ dataset 5000
[1211 00:10:30 @parallel.py:219] [MultiProcessRunner] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d.
[2021-12-11 00:10:35,502] [train] [INFO] initializing model...
/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer_v1.py:1694: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
[2021-12-11 00:10:36,942] [train] [INFO] use l2 loss
[2021-12-11 00:10:37,340] [train] [INFO] use l2 loss
[2021-12-11 00:10:37,434] [train] [INFO] use l2 loss
[2021-12-11 00:10:37,436] [train] [INFO] use l2 loss
[2021-12-11 00:10:37,437] [train] [INFO] use l2 loss
[2021-12-11 00:10:37,439] [train] [INFO] use l2 loss
[2021-12-11 00:10:38,516] [train] [INFO] initialize saver...
[2021-12-11 00:10:38,577] [train] [INFO] initialize tensorboard
[2021-12-11 00:10:38,598] [train] [INFO] initialize session...
Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB

2021-12-11 00:10:38.610424: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-12-11 00:10:38.610766: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-12-11 00:10:39.126790: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2021-12-11 00:10:39.128769: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
[2021-12-11 00:10:41,409] [train] [INFO] restoring vgg weights from checkpoints/vgg/vgg_19.ckpt
2021-12-11 00:10:41.445859: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
[2021-12-11 00:10:43,094] [train] [INFO] start training...
  0%|                                                 | 0/11828 [00:00<?, ?it/s]2021-12-11 00:10:45.997830: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
Process _Worker-8:
Traceback (most recent call last):
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorpack/dataflow/parallel.py", line 190, in run
    self.ds.reset_state()
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorpack/dataflow/base.py", line 181, in reset_state
    self.ds.reset_state()
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorpack/dataflow/parallel.py", line 238, in reset_state
    start_proc_mask_signal(self.procs)
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorpack/utils/concurrency.py", line 240, in start_proc_mask_signal
    p.start()
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/multiprocessing/process.py", line 116, in start
    assert self._parent_pid == os.getpid(), \
AssertionError: can only start a process object created by current process
Process _Worker-9:
Traceback (most recent call last):
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorpack/dataflow/parallel.py", line 190, in run
    self.ds.reset_state()
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorpack/dataflow/base.py", line 181, in reset_state
    self.ds.reset_state()
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorpack/dataflow/parallel.py", line 238, in reset_state
    start_proc_mask_signal(self.procs)
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/site-packages/tensorpack/utils/concurrency.py", line 240, in start_proc_mask_signal
    p.start()
  File "/Users/bilalk/miniforge3/envs/tf/lib/python3.8/multiprocessing/process.py", line 116, in start
    assert self._parent_pid == os.getpid(), \
AssertionError: can only start a process object created by current process

Seems like something related to threads needs to be edited. Any help? Thanks.

maketo97 commented 2 years ago

@bilalkhann16 Hi, do you found the solution to this issue?

mengqing-123 commented 2 years ago

我也遇到这个问题了

maketo97 commented 2 years ago

@mengqing-123 你那边有解决到这个问题吗?

@YangZeyu95 @kurzacz 版主有什么建议去解决这个问题吗?

bilalkhann16 commented 2 years ago

I encountered this problem on ARM64 Mac and I solved this by changing this line in pose_dataset.py fileds = PrefetchData(ds, 100, multiprocessing.cpu_count() * 4) to ds = PrefetchData(ds, 100, multiprocessing.cpu_count() * 1)

// I don't think this error may occur on Ubuntu/Linux @maketo97 @mengqing-123

maketo97 commented 2 years ago

@bilalkhann16 I encountered this problem on Window environment. But after I change the line, the problem still persists😢 To double confirm, this lines need to be changed are ds = PrefetchData(ds, 100, multiprocessing.cpu_count() //4)?

ZHU883000 commented 2 years ago

我注释掉了pose_dataset.py文件中if里面的(ds, 100, multiprocessing.cpu_count()-1)那句,我感觉这句是写错了,然后就不报这个错了。不过又有新的错了。。