alex-petrenko / sample-factory

High throughput synchronous and asynchronous reinforcement learning
https://samplefactory.dev
MIT License
811 stars 109 forks source link

Fix MacOS test #212

Closed alex-petrenko closed 1 year ago

alex-petrenko commented 1 year ago

Occasionally macOS tests fail with this error:

2022-10-26T06:58:50.3302200Z Traceback (most recent call last):
2022-10-26T06:58:50.3411690Z   File "<string>", line 1, in <module>
2022-10-26T06:58:50.3415760Z   File "/usr/local/miniconda/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
2022-10-26T06:58:50.3416500Z     exitcode = _main(fd, parent_sentinel)
2022-10-26T06:58:50.3417390Z   File "/usr/local/miniconda/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
2022-10-26T06:58:50.3418490Z     self = reduction.pickle.load(from_parent)
2022-10-26T06:58:50.3419730Z   File "/usr/local/miniconda/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 314, in rebuild_storage_filename
2022-10-26T06:58:50.3421190Z     storage = torch._UntypedStorage._new_shared_filename_cpu(manager, handle, size)
2022-10-26T06:58:50.3421870Z RuntimeError: Connection refused
2022-10-26T06:58:50.3550240Z [2022-10-26 06:58:50,354][08866] Rollout worker 13 starting...
2022-10-26T06:58:50.3551270Z [2022-10-26 06:58:50,354][08866] ROLLOUT worker 13    pid 8866    parent 3388
2022-10-26T06:58:50.3551730Z [2022-10-26 06:58:50,354][08861] Rollout worker 12 starting...
2022-10-26T06:58:50.3552170Z [2022-10-26 06:58:50,354][08861] ROLLOUT worker 12    pid 8861    parent 3388
2022-10-26T06:58:50.3623220Z [2022-10-26 06:58:50,357][08860] Rollout worker 18 starting...
2022-10-26T06:58:50.3704610Z [2022-10-26 06:58:50,357][08860] ROLLOUT worker 18    pid 8860    parent 3388
2022-10-26T06:58:50.3750890Z [2022-10-26 06:58:50,357][08841] Rollout worker 6 starting...
2022-10-26T06:58:50.3753110Z [2022-10-26 06:58:50,358][08860] On MacOS, not setting affinity
2022-10-26T06:58:50.3753650Z [2022-10-26 06:58:50,358][08841] ROLLOUT worker 6 pid 8841    parent 3388
2022-10-26T06:58:50.3754070Z [2022-10-26 06:58:50,358][08841] On MacOS, not setting affinity
2022-10-26T06:58:50.3754460Z [2022-10-26 06:58:50,360][08866] On MacOS, not setting affinity
2022-10-26T06:58:50.3754840Z [2022-10-26 06:58:50,360][08861] On MacOS, not setting affinity
2022-10-26T06:58:50.3755290Z [2022-10-26 06:58:50,361][03388] Heartbeat connected on RolloutWorker_w18
2022-10-26T06:58:50.3755640Z Exception ignored in: <function EventLoopObject.__del__ at 0x7fb00d546820>
2022-10-26T06:58:50.3755920Z Traceback (most recent call last):
2022-10-26T06:58:50.3756510Z   File "/usr/local/miniconda/lib/python3.9/site-packages/signal_slot/signal_slot.py", line 238, in __del__
2022-10-26T06:58:50.3756820Z     self.detach()
2022-10-26T06:58:50.3757320Z   File "/usr/local/miniconda/lib/python3.9/site-packages/signal_slot/signal_slot.py", line 233, in detach
2022-10-26T06:58:50.3757630Z     if self.event_loop:
2022-10-26T06:58:50.3758020Z AttributeError: 'EventLoopProcess' object has no attribute 'event_loop'
2022-10-26T06:58:50.3758370Z Exception ignored in: <function EventLoopObject.__del__ at 0x7fb00d546820>
2022-10-26T06:58:50.3758650Z Traceback (most recent call last):
2022-10-26T06:58:50.3777770Z   File "/usr/local/miniconda/lib/python3.9/site-packages/signal_slot/signal_slot.py", line 238, in __del__
2022-10-26T06:58:50.3778520Z     self.detach()
2022-10-26T06:58:50.3779790Z   File "/usr/local/miniconda/lib/python3.9/site-packages/signal_slot/signal_slot.py", line 233, in detach
2022-10-26T06:58:50.3780110Z     if self.event_loop:
2022-10-26T06:58:50.3780490Z AttributeError: 'RolloutWorker' object has no attribute 'event_loop'

Need to investigate.

  1. Google the first error to see what the internet says
  2. We can also just modify this test so that it starts fewer processes (now it is 32!). Really no need to do it in tests. @wmFrank no rush on this but would be great to do in the next 1-2 weeks.
alex-petrenko commented 1 year ago

Fixed