alex-petrenko / sample-factory

High throughput synchronous and asynchronous reinforcement learning
https://samplefactory.dev
MIT License
773 stars 107 forks source link

Sf2 ci fail fix #225

Closed wmFrank closed 1 year ago

wmFrank commented 1 year ago

auto retry when tests fail

codecov-commenter commented 1 year ago

Codecov Report

Base: 80.53% // Head: 80.48% // Decreases project coverage by -0.05% :warning:

Coverage data is based on head (ddb982b) compared to base (590c70d). Patch has no changes to coverable lines.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## sf2 #225 +/- ## ========================================== - Coverage 80.53% 80.48% -0.06% ========================================== Files 92 92 Lines 7368 7372 +4 ========================================== - Hits 5934 5933 -1 - Misses 1434 1439 +5 ``` | [Impacted Files](https://codecov.io/gh/alex-petrenko/sample-factory/pull/225?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Aleksei+Petrenko) | Coverage Δ | | |---|---|---| | [sample\_factory/huggingface/huggingface\_utils.py](https://codecov.io/gh/alex-petrenko/sample-factory/pull/225/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Aleksei+Petrenko#diff-c2FtcGxlX2ZhY3RvcnkvaHVnZ2luZ2ZhY2UvaHVnZ2luZ2ZhY2VfdXRpbHMucHk=) | `16.94% <0.00%> (-1.24%)` | :arrow_down: | | [sample\_factory/algo/learning/learner.py](https://codecov.io/gh/alex-petrenko/sample-factory/pull/225/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Aleksei+Petrenko#diff-c2FtcGxlX2ZhY3RvcnkvYWxnby9sZWFybmluZy9sZWFybmVyLnB5) | `87.85% <0.00%> (-0.16%)` | :arrow_down: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Aleksei+Petrenko). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Aleksei+Petrenko)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

alex-petrenko commented 1 year ago

https://pipelines.actions.githubusercontent.com/serviceHosts/7f2d7480-eacb-4e2d-8471-7637fec27dcc/_apis/pipelines/1/runs/1392/signedlogcontent/3?urlExpires=2022-11-13T00%3A24%3A46.0880017Z&urlSigningMethod=HMACV1&urlSignature=iSj4WeMV96aNfMlL02nTdnB3WT6yxuJYFN%2BcN9rxA%2BQ%3D

Last time the tests failed again. Take a look, you can see the error message at the end of the log file.

2022-11-12T06:31:03.2140390Z [2022-11-12 06:31:03,210][09733] EvtLoop [learner_proc0_evt_loop, process=learner_proc0] unhandled exception in slot='init' connected to emitter=Emitter(object_id='Runner_EvtLoop', signal_name='start'), args=()
2022-11-12T06:31:03.2141340Z Traceback (most recent call last):
2022-11-12T06:31:03.2142650Z   File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 355, in _process_signal
2022-11-12T06:31:03.2143020Z     slot_callable(*args)
2022-11-12T06:31:03.2143580Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner_worker.py", line 139, in init
2022-11-12T06:31:03.2143930Z     init_model_data = self.learner.init()
2022-11-12T06:31:03.2144740Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner.py", line 214, in init
2022-11-12T06:31:03.2145340Z     self.actor_critic = create_actor_critic(self.cfg, self.env_info.obs_space, self.env_info.action_space)
2022-11-12T06:31:03.2146590Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 296, in create_actor_critic
2022-11-12T06:31:03.2148110Z     return make_actor_critic_func(cfg, obs_space, action_space)
2022-11-12T06:31:03.2148850Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 286, in default_make_actor_critic_func
2022-11-12T06:31:03.2149290Z     return ActorCriticSharedWeights(model_factory, obs_space, action_space, cfg)
2022-11-12T06:31:03.2149880Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 141, in __init__
2022-11-12T06:31:03.2150870Z     self.encoder = model_factory.make_model_encoder_func(cfg, obs_space)
2022-11-12T06:31:03.2151960Z   File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 130, in make_custom_encoder
2022-11-12T06:31:03.2152360Z     return CustomEncoder(cfg, obs_space)
2022-11-12T06:31:03.2152940Z   File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 114, in __init__
2022-11-12T06:31:03.2153960Z     self.conv_head_out_size = calc_num_elements(self.conv_head, obs_shape)
2022-11-12T06:31:03.2155170Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/utils/torch_utils.py", line 39, in calc_num_elements
2022-11-12T06:31:03.2155650Z     num_elements = module(some_input).numel()
2022-11-12T06:31:03.2156420Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
2022-11-12T06:31:03.2157060Z     return forward_call(*input, **kwargs)
2022-11-12T06:31:03.2157780Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
2022-11-12T06:31:03.2158340Z     input = module(input)
2022-11-12T06:31:03.2159160Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
2022-11-12T06:31:03.2160390Z     return forward_call(*input, **kwargs)
2022-11-12T06:31:03.2161080Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
2022-11-12T06:31:03.2161690Z     return self._conv_forward(input, self.weight, self.bias)
2022-11-12T06:31:03.2162270Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
2022-11-12T06:31:03.2162610Z     return F.conv2d(input, weight, bias, self.stride,
2022-11-12T06:31:03.2162950Z RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 27]
2022-11-12T06:31:03.2163630Z [2022-11-12 06:31:03,213][09733] Unhandled exception Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 27] in evt loop learner_proc0_evt_loop

Looks like something is wrong with the observation shape, instead of an image the convolutional encoder gets a vector? This is a legitimate error and we should properly fix it instead of retrying the test. So far I don't understand why this happens only infrequently.

wmFrank commented 1 year ago

I cannot open the link, it seems to be expired. The failed test is test_example_sampler.py.

  1. I ran that test 100 times on my own macbook, it passed. I also ran that test 100 times on github actions, also passed. see here: https://github.com/wmFrank/sample-factory/actions/runs/3454065762
  2. Currently trying to reproduce the error.
wmFrank commented 1 year ago

These are the failed tests that have occurred in different hanging tests:

tests/examples/test_example_multi.py
tests/algo/test_pbt.py
tests/envs/atari/test_atari.py
tests/envs/mujoco/test_mujoco.py

These are the types of errors that have occurred in different hanging tests:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/miniconda/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/miniconda/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 322, in rebuild_storage_filename
    storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
RuntimeError: Connection refused

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 238, in __del__
    self.detach()
  File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 233, in detach
    if self.event_loop:
AttributeError: 'EventLoopProcess' object has no attribute 'event_loop'
[W NNPACK.cpp:53] Could not initialize NNPACK! Reason: Unsupported hardware.

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 355, in _process_signal
    slot_callable(*args)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner_worker.py", line 139, in init
    init_model_data = self.learner.init()
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner.py", line 214, in init
    self.actor_critic = create_actor_critic(self.cfg, self.env_info.obs_space, self.env_info.action_space)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 296, in create_actor_critic
    return make_actor_critic_func(cfg, obs_space, action_space)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 286, in default_make_actor_critic_func
    return ActorCriticSharedWeights(model_factory, obs_space, action_space, cfg)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 141, in __init__
    self.encoder = model_factory.make_model_encoder_func(cfg, obs_space)
  File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 130, in make_custom_encoder
    return CustomEncoder(cfg, obs_space)
  File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 114, in __init__
    self.conv_head_out_size = calc_num_elements(self.conv_head, obs_shape)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/utils/torch_utils.py", line 39, in calc_num_elements
    num_elements = module(some_input).numel()
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [8, 1, 3, 3], expected input[1, 4, 84, 84] to have 1 channels, but got 4 channels instead

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 355, in _process_signal
    slot_callable(*args)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner_worker.py", line 139, in init
    init_model_data = self.learner.init()
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner.py", line 214, in init
    self.actor_critic = create_actor_critic(self.cfg, self.env_info.obs_space, self.env_info.action_space)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 296, in create_actor_critic
    return make_actor_critic_func(cfg, obs_space, action_space)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 286, in default_make_actor_critic_func
    return ActorCriticSharedWeights(model_factory, obs_space, action_space, cfg)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 141, in __init__
    self.encoder = model_factory.make_model_encoder_func(cfg, obs_space)
  File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 130, in make_custom_encoder
    return CustomEncoder(cfg, obs_space)
  File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 114, in __init__
    self.conv_head_out_size = calc_num_elements(self.conv_head, obs_shape)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/utils/torch_utils.py", line 39, in calc_num_elements
    num_elements = module(some_input).numel()
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 27]

I ran test_example_multi more than 100 times on github actions, there shows up an error:

Components take too long to start ... Aborting the experiment.