facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.26k stars 334 forks source link

[WIP] Fixes for release #437

Closed iseessel closed 3 years ago

iseessel commented 3 years ago
  1. Add in appropriate pytorch/cuda verisons for building apex in conda_apex and conda_vissl.
  2. Separate out integration_tests.sh, as this was repeating unit tests in the apex builds.
  3. Make #in_temporary_directory exception-safe -- when a test failed using this, all subsequent tests would fail with:
======================================================================
ERROR: test_restart_after_preemption_at_epoch (test_state_checkpointing.TestStateCheckpointing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/private/home/iseessel/conda-bld/vissl_1633020455653/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/utils/test_utils.py", line 86, in wrapped_test
    return test_function(*args, **kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633020455653/test_tmp/tests/test_state_checkpointing.py", line 80, in test_restart_after_preemption_at_epoch
    with in_temporary_directory():
  File "/private/home/iseessel/conda-bld/vissl_1633020455653/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/contextlib.py", line 81, in _enter_
    return next(self.gen)
  File "/private/home/iseessel/conda-bld/vissl_1633020455653/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/utils/test_utils.py", line 29, in in_temporary_directory
    old_cwd = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory
  1. Destroy process group after each test in test_tasks.py. After building, conda runs the unit tests for vissl, and the same process group is used after the initial test. Since we start on GPU tests, we use the nccl backend and we keep using it throughout the tests. One of the tests requires the gloo backend, since it calls all_gather on cpu tensors. Note we don't get this problem with circle-ci because we split out the tests. The specific error is:
ERROR: test_run_0_config_test_cpu_test_test_cpu_regnet_moco_yaml (test_tasks.TaskTest)
Instantiate and run all the test tasks [with config_file_path='config=test/cpu_test/test_cpu_regnet_moco.yaml']
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/parameterized/parameterized.py", line 533, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/test_tmp/tests/test_tasks.py", line 50, in test_run
    hook_generator=default_hook_generator,
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/engines/train.py", line 130, in train_main
    trainer.train()
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/trainer/trainer_main.py", line 201, in train
    raise e
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/trainer/trainer_main.py", line 193, in train
    task = train_step_fn(task)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/trainer/train_steps/standard_train_step.py", line 158, in standard_train_step
    local_loss = task.loss(model_output, target)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/losses/moco_loss.py", line 152, in forward
    self._dequeue_and_enqueue(self.key)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/losses/moco_loss.py", line 89, in _dequeue_and_enqueue
    keys = concat_all_gather(key)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/utils/misc.py", line 230, in concat_all_gather
    torch.distributed.all_gather(tensors_gather, tensor, async_op=False)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1863, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

----------------------------------------------------------------------
Ran 1455 tests in 2664.959s

FAILED (errors=1)
Tests failed for vissl-0.1.5-py36.tar.bz2 - moving package to /private/home/iseessel/conda-bld/broken
WARNING:conda_build.build:Tests failed for vissl-0.1.5-py36.tar.bz2 - moving package to /private/home/iseessel/conda-bld/broken
WARNING conda_build.build:tests_failed(2955): Tests failed for vissl-0.1.5-py36.tar.bz2 - moving package to /private/home/iseessel/conda-bld/broken
TESTS FAILED: vissl-0.1.5-py36.tar.bz2
  1. Use specific commit of fairscale as per circle-ci documentation.
facebook-github-bot commented 3 years ago

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

prigoyal commented 3 years ago

looks good to me! let's wait for the tests to pass and we can merge it. Thank you so much!

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 3 years ago

@iseessel has updated the pull request. You must reimport the pull request before landing.

iseessel commented 3 years ago

Closing in favor of smaller separate PRs.