Open tmarkstrum opened 2 years ago
Does this fail from time to time or consistently for you @tmarkstrum? It seems that CI is green for the main branch though.
@tmarkstrum can you confirm that these tests run on CircleCI? Does it fail for you locally or do you only see the error post your change on CircleCI?
I saw these tests failed in CircleCI. It also failed locally. I can retry it today to confirm.
I think the main thing to identify is why main is green. That will tell us how to approach debugging. If the test fails only because of your change that we can repro it. However if it was failing much before your change then we need to do a bisect.
@tmarkstrum Following up about if you saw that the main branch was green. I think we've disabled the test now but from what I recall, the branch was green before your change.
This is at main branch.
~/fairscale$ pytest tests/experimental/nn/test_offload.py ================================================================================ test session starts ================================================================================= platform linux -- Python 3.8.8, pytest-5.4.1, py-1.10.0, pluggy-0.13.1 -- /private/home/tmarkstrum/.conda/envs/dev/bin/python cachedir: .pytest_cache rootdir: /private/home/tmarkstrum/fairscale, inifile: setup.cfg plugins: mock-3.6.0, httpx-0.10.0, asyncio-0.14.0, hydra-core-1.0.6, timeout-1.4.2, cov-2.10.0 collected 17 items
tests/experimental/nn/test_offload.py::test_single_run PASSED [ 5%] tests/experimental/nn/test_offload.py::test_correctness[True-1-True-True] FAILED [ 11%] tests/experimental/nn/test_offload.py::test_correctness[True-1-True-False] FAILED [ 17%] tests/experimental/nn/test_offload.py::test_correctness[True-1-False-True] FAILED [ 23%] tests/experimental/nn/test_offload.py::test_correctness[True-1-False-False] FAILED [ 29%] tests/experimental/nn/test_offload.py::test_correctness[True-5-True-True] FAILED [ 35%] tests/experimental/nn/test_offload.py::test_correctness[True-5-True-False] FAILED [ 41%] tests/experimental/nn/test_offload.py::test_correctness[True-5-False-True] SKIPPED [ 47%] tests/experimental/nn/test_offload.py::test_correctness[True-5-False-False] SKIPPED [ 52%] tests/experimental/nn/test_offload.py::test_correctness[False-1-True-True] FAILED [ 58%] tests/experimental/nn/test_offload.py::test_correctness[False-1-True-False] FAILED [ 64%] tests/experimental/nn/test_offload.py::test_correctness[False-1-False-True] FAILED [ 70%] tests/experimental/nn/test_offload.py::test_correctness[False-1-False-False] FAILED [ 76%] tests/experimental/nn/test_offload.py::test_correctness[False-5-True-True] FAILED [ 82%] tests/experimental/nn/test_offload.py::test_correctness[False-5-True-False] PASSED [ 88%] tests/experimental/nn/test_offload.py::test_correctness[False-5-False-True] SKIPPED [ 94%] tests/experimental/nn/test_offload.py::test_correctness[False-5-False-False] SKIPPED [100%]
====================================================================================== FAILURES ====================================================================================== _____ testcorrectness[True-1-True-True] ____
use_fp16 = True, checkpoint_activation = True, num_microbatches = 1, use_auto_shard = True
tests/experimental/nn/test_offload.py:169:
rmodel = Sequential( (0): Linear(in_features=2, out_features=20, bias=True) (1): Linear(in_features=20, out_features=20, bi... (10): Linear(in_features=20, out_features=20, bias=True) (11): Linear(in_features=20, out_features=2, bias=True) ) omodel = OffloadModel( (_model): Sequential( (0): ModelShard( (model_shard): GraphModule( (0): Linear(in_fe...( (model_shard): GraphModule( (11): Linear(in_features=20, out_features=2, bias=True) ) ) ) ) ropt = SGD ( Parameter Group 0 dampening: 0 lr: 0.001 momentum: 0 nesterov: False weight_decay: 0 ) oopt = SGD ( Parameter Group 0 dampening: 0 lr: 0.001 momentum: 0 nesterov: False weight_decay: 0 ) rloss = tensor(75.6853, device='cuda:0', grad_fn=), oloss = tensor(63.9051, device='cuda:0', grad_fn=)
tests/experimental/nn/test_offload.py:82: AssertionError