SACLabs / TransWorldNG

Empower Traffic Simulation via Foundation Model
The Unlicense
20 stars 3 forks source link

raise cannot pickle '_thread.lock' object error when use train_loader #4

Open YSXXXXXXX opened 9 months ago

YSXXXXXXX commented 9 months ago

Describe the bug Hi, when I run the transworld_exp.py file, the following error occurs:

Traceback (most recent call last):
  File "C:\TransWorldNG\transworld\transworld_exp.py", line 252, in <module>
    run(args.scenario,args.train_data, args.training_step, args.pred_step, args.hid_dim, args.n_head, args.n_layer, device)
  File "C:\TransWorldNG\transworld\transworld_exp.py", line 201, in run
    loss_lst = train(timestamps, graph, batch_size, num_workers, encoder, generator, veh_route, loss_fcn, optimizer, logger, device)
  File "C:\TransWorldNG\transworld\transworld_exp.py", line 40, in train
    for i, (cur_graphs, next_graphs) in enumerate(train_loader): 
  File "D:\anaconda\envs\TransWorldNG\lib\site-packages\torch\utils\data\dataloader.py", line 444, in __iter__
    return self._get_iterator()
  File "D:\anaconda\envs\TransWorldNG\lib\site-packages\torch\utils\data\dataloader.py", line 390, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "D:\anaconda\envs\TransWorldNG\lib\site-packages\torch\utils\data\dataloader.py", line 1077, in __init__
    w.start()
  File "D:\anaconda\envs\TransWorldNG\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "D:\anaconda\envs\TransWorldNG\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "D:\anaconda\envs\TransWorldNG\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "D:\anaconda\envs\TransWorldNG\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "D:\anaconda\envs\TransWorldNG\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_thread.lock' object

To Reproduce run python file TransWorldNG\transworld\transworld_exp.py

Expected behavior train_loader should return the sampled current and next timestamp graphs, in Line: 40.

Desktop:

Possible solution I notice w.start() appears in the python traceback, so I check the objects that contain in parameter args (see the following Python statement), and find self._collate_fn cannot be pickled.

w = multiprocessing_context.Process(
    target=_utils.worker._worker_loop,
    args=(self._dataset_kind, self._dataset, index_queue,
        self._worker_result_queue, self._workers_done_event,
        self._auto_collation, self._collate_fn, self._drop_last,
        self._base_seed, self._worker_init_fn, i, self._num_workers,
        self._persistent_workers, self._shared_seed))

Maybe this error is associated with the class Node, which uses queue.Queue (the queue.Queue has thread.lock). I think collections.deque can be an alternative replacement. For more information, please see TypeError: can't pickle _thread.lock objects and Python Multiprocessing Pool.map Causes Error in __new__.

lovelybirds commented 9 months ago

It seems you're encountering an issue with multiprocessing, the error occurs when you're trying to use DataLoader with multiple workers (num_workers > 0), which involves pickling and unpickling data.

It can sometimes be influenced by the Python version in use. One thing you can try is to check the Python version, we are using Python 3.9. Secondly, a potential solution is to set the number of workers to 0, on line 166 in TransWorldNG\transworld\transworld_exp.py. This adjustment can disable multiprocessing for data loading. After which you may narrow down the cause of the problem.

YSXXXXXXX commented 9 months ago

Hi, lovelybirds, Thank you for your reply. I agree with your second method. But for the first one, in the issue description, as you see, my Python version is also 3.9. I think queue.Queue (transworld/game/core/node.py Line: 9) may be unsuitable for multiple workers.

nudtdyk commented 8 months ago

Hi, lovelybirds, When I set the number of workers to 0,there is another problem: image

lovelybirds commented 8 months ago

Hi nudtdyk,

I noticed the error on line 218 with batch//num_workers. My apologies for suggesting worker=0. Please try worker = 1 to circumvent the integer division by zero issues.

We're in the process of creating a Docker environment with the same configurations. Hope this should help in preventing such issues in the future.

nudtdyk commented 8 months ago

Hi, lovelybirds, Thank you for your reply. But if I set worker = 1,I will encounter the same problem as YSXXXXXXX proposed.