facebookresearch / fairscale

PyTorch extensions for high performance and large scale training.
Other
3.19k stars 280 forks source link

Unit tests tests/experimental/nn/test_offload.py::test_correctness failed in the main branch #900

Open tmarkstrum opened 2 years ago

tmarkstrum commented 2 years ago

This is at main branch.

~/fairscale$ pytest tests/experimental/nn/test_offload.py ================================================================================ test session starts ================================================================================= platform linux -- Python 3.8.8, pytest-5.4.1, py-1.10.0, pluggy-0.13.1 -- /private/home/tmarkstrum/.conda/envs/dev/bin/python cachedir: .pytest_cache rootdir: /private/home/tmarkstrum/fairscale, inifile: setup.cfg plugins: mock-3.6.0, httpx-0.10.0, asyncio-0.14.0, hydra-core-1.0.6, timeout-1.4.2, cov-2.10.0 collected 17 items

tests/experimental/nn/test_offload.py::test_single_run PASSED [ 5%] tests/experimental/nn/test_offload.py::test_correctness[True-1-True-True] FAILED [ 11%] tests/experimental/nn/test_offload.py::test_correctness[True-1-True-False] FAILED [ 17%] tests/experimental/nn/test_offload.py::test_correctness[True-1-False-True] FAILED [ 23%] tests/experimental/nn/test_offload.py::test_correctness[True-1-False-False] FAILED [ 29%] tests/experimental/nn/test_offload.py::test_correctness[True-5-True-True] FAILED [ 35%] tests/experimental/nn/test_offload.py::test_correctness[True-5-True-False] FAILED [ 41%] tests/experimental/nn/test_offload.py::test_correctness[True-5-False-True] SKIPPED [ 47%] tests/experimental/nn/test_offload.py::test_correctness[True-5-False-False] SKIPPED [ 52%] tests/experimental/nn/test_offload.py::test_correctness[False-1-True-True] FAILED [ 58%] tests/experimental/nn/test_offload.py::test_correctness[False-1-True-False] FAILED [ 64%] tests/experimental/nn/test_offload.py::test_correctness[False-1-False-True] FAILED [ 70%] tests/experimental/nn/test_offload.py::test_correctness[False-1-False-False] FAILED [ 76%] tests/experimental/nn/test_offload.py::test_correctness[False-5-True-True] FAILED [ 82%] tests/experimental/nn/test_offload.py::test_correctness[False-5-True-False] PASSED [ 88%] tests/experimental/nn/test_offload.py::test_correctness[False-5-False-True] SKIPPED [ 94%] tests/experimental/nn/test_offload.py::test_correctness[False-5-False-False] SKIPPED [100%]

====================================================================================== FAILURES ====================================================================================== _____ testcorrectness[True-1-True-True] ____

use_fp16 = True, checkpoint_activation = True, num_microbatches = 1, use_auto_shard = True

@skip_if_no_cuda
@pytest.mark.parametrize("use_fp16", [True, False])
@pytest.mark.parametrize("checkpoint_activation", [True, False])
@pytest.mark.parametrize("num_microbatches", [1, 5])
@pytest.mark.parametrize("use_auto_shard", [True, False])
def test_correctness(use_fp16, checkpoint_activation, num_microbatches, use_auto_shard):
    if use_auto_shard and torch_version() < (1, 8, 0):
        pytest.skip("auto_shard requires torch version >= 1.8.0")

    if (use_fp16 or checkpoint_activation) and not hasattr(torch.cuda.amp, "custom_fwd"):
        pytest.skip(f"AMP APIs are not supported in torch version {torch.__version__}")

    if not checkpoint_activation and num_microbatches > 1:
        pytest.skip("We only support microbatches with activation offloading.")

    device, offload_device = _init()
    model = _get_model()
    if use_auto_shard:
        offload_model = shard_model(model)
    else:
        offload_model = model

    rmodel, ropt, rloss = _train_reg_model(model, device, offload_device)
    omodel, oopt, oloss = _train_offload_model(
        offload_model,
        device,
        offload_device,
        use_fp16=use_fp16,
        checkpoint_activation=checkpoint_activation,
        num_microbatches=num_microbatches,
    )
  _check_parity(rmodel.cpu(), omodel.cpu(), ropt, oopt, rloss, oloss)

tests/experimental/nn/test_offload.py:169:


rmodel = Sequential( (0): Linear(in_features=2, out_features=20, bias=True) (1): Linear(in_features=20, out_features=20, bi... (10): Linear(in_features=20, out_features=20, bias=True) (11): Linear(in_features=20, out_features=2, bias=True) ) omodel = OffloadModel( (_model): Sequential( (0): ModelShard( (model_shard): GraphModule( (0): Linear(in_fe...( (model_shard): GraphModule( (11): Linear(in_features=20, out_features=2, bias=True) ) ) ) ) ropt = SGD ( Parameter Group 0 dampening: 0 lr: 0.001 momentum: 0 nesterov: False weight_decay: 0 ) oopt = SGD ( Parameter Group 0 dampening: 0 lr: 0.001 momentum: 0 nesterov: False weight_decay: 0 ) rloss = tensor(75.6853, device='cuda:0', grad_fn=), oloss = tensor(63.9051, device='cuda:0', grad_fn=)

def _check_parity(rmodel, omodel, ropt, oopt, rloss, oloss):

    for oparams, rparams in zip(omodel.parameters(), rmodel.parameters()):
      assert torch.allclose(oparams, rparams, atol=1e-2), f"Model params are different {oparams} {rparams}"

E AssertionError: Model params are different Parameter containing: E tensor([-2.4198e+02, 5.9091e+01, -1.2879e-01, 1.7043e-01, -4.2168e-02, E -1.5999e-01, 3.0678e+03, -8.5007e-01, -1.7349e-01, 1.0862e-01, E 3.2033e-02, 1.2150e-01, -1.6802e-01, 1.9578e-01, -2.2282e-01, E -1.6971e-01, 1.3411e-01, 7.4107e-03, -1.2552e-01, -4.6827e-02], E requires_grad=True) Parameter containing: E tensor([ 0.0089, 0.1420, -0.1264, 0.1199, -0.0759, -0.1768, -0.2102, 0.1084, E -0.1780, 0.1071, 0.0278, 0.1207, -0.1773, 0.1858, -0.2232, -0.1712, E 0.1249, 0.0064, -0.1314, -0.0405], requires_grad=True) E assert False E + where False = <built-in method allclose of type object at 0x7f60f2d77ec0>(Parameter containing:\ntensor([-2.4198e+02, 5.9091e+01, -1.2879e-01, 1.7043e-01, -4.2168e-02,\n -1.5999e-01, 3...e-01, -2.2282e-01,\n -1.6971e-01, 1.3411e-01, 7.4107e-03, -1.2552e-01, -4.6827e-02],\n requires_grad=True), Parameter containing:\ntensor([ 0.0089, 0.1420, -0.1264, 0.1199, -0.0759, -0.1768, -0.2102, 0.1084,\n -0.1780,... 0.0278, 0.1207, -0.1773, 0.1858, -0.2232, -0.1712,\n 0.1249, 0.0064, -0.1314, -0.0405], requires_grad=True), atol=0.01) E + where <built-in method allclose of type object at 0x7f60f2d77ec0> = torch.allclose

tests/experimental/nn/test_offload.py:82: AssertionError

min-xu-ai commented 2 years ago

Does this fail from time to time or consistently for you @tmarkstrum? It seems that CI is green for the main branch though.

anj-s commented 2 years ago

@tmarkstrum can you confirm that these tests run on CircleCI? Does it fail for you locally or do you only see the error post your change on CircleCI?

tmarkstrum commented 2 years ago

I saw these tests failed in CircleCI. It also failed locally. I can retry it today to confirm.

anj-s commented 2 years ago

I think the main thing to identify is why main is green. That will tell us how to approach debugging. If the test fails only because of your change that we can repro it. However if it was failing much before your change then we need to do a bisect.

anj-s commented 2 years ago

@tmarkstrum Following up about if you saw that the main branch was green. I think we've disabled the test now but from what I recall, the branch was green before your change.