Lightning-Universe / lightning-bolts

Toolbox of models, callbacks, and datasets for AI/ML researchers.
https://lightning-bolts.readthedocs.io
Apache License 2.0
1.68k stars 320 forks source link

Fix all Windows tests #526

Closed Borda closed 2 years ago

Borda commented 3 years ago

πŸ› Bug

In #522 we have to reveal an issue in CI (rather on the GHA side than ours)that Windows tests we marked as passing check even almost all the time the tests were failing... These tests shall be fixed or skip per test with todo...

To Reproduce

https://github.com/PyTorchLightning/pytorch-lightning-bolts/runs/1718502565

=========================== short test summary info ===========================
FAILED tests/models/test_autoencoders.py::test_vae[cifar10] - BrokenPipeError...
FAILED tests/models/test_autoencoders.py::test_ae[cifar10] - BrokenPipeError:...
FAILED tests/models/test_autoencoders.py::test_encoder - RuntimeError: [enfor...
FAILED tests/models/test_autoencoders.py::test_decoder - RuntimeError: [enfor...
FAILED tests/models/test_autoencoders.py::test_from_pretrained - MemoryError
FAILED tests/models/test_classic_ml.py::test_logistic_regression_model - Runt...
FAILED tests/models/test_detection.py::test_fasterrcnn - RuntimeError: [enfor...
FAILED tests/models/test_mnist_templates.py::test_mnist - RuntimeError: [enfo...
FAILED tests/models/test_scripts.py::test_cli_run_basic_gan[ --dataset %(dataset_name)s --data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --batch_size 2 --limit_train_batches 2 --limit_val_batches 2-mnist]
FAILED tests/models/test_scripts.py::test_cli_run_basic_gan[ --dataset %(dataset_name)s --data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --batch_size 2 --limit_train_batches 2 --limit_val_batches 2-cifar10]
FAILED tests/models/test_scripts.py::test_cli_run_mnist[--data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --max_steps 2]
FAILED tests/models/test_scripts.py::test_cli_run_basic_vae[ --dataset cifar10 --data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --batch_size 2 --fast_dev_run 1 --num_workers 0]
FAILED tests/models/test_scripts.py::test_cli_run_lin_regression[--max_epochs 1 --max_steps 2]
FAILED tests/models/test_scripts.py::test_cli_run_log_regression[--max_epochs 1 --max_steps 2]
FAILED tests/models/test_vision.py::test_igpt - BrokenPipeError: [Errno 32] B...
FAILED tests/models/test_vision.py::test_semantic_segmentation - RuntimeError...
FAILED tests/models/rl/test_scripts.py::test_cli_run_rl_dqn[--env PongNoFrameskip-v4 --max_steps 10 --fast_dev_run 1 --warm_start_size 10 --n_steps 2 --batch_size 10]
FAILED tests/models/rl/integration/test_value_models.py::TestValueModels::test_double_dqn
FAILED tests/models/rl/integration/test_value_models.py::TestValueModels::test_dqn
FAILED tests/models/rl/integration/test_value_models.py::TestValueModels::test_dueling_dqn
FAILED tests/models/rl/integration/test_value_models.py::TestValueModels::test_noisy_dqn
FAILED tests/models/rl/integration/test_value_models.py::TestValueModels::test_per_dqn
FAILED tests/models/rl/unit/test_memory.py::TestBuffer::test_sample_batch - n...
FAILED tests/models/self_supervised/test_models.py::test_amdim - RuntimeError...
FAILED tests/models/self_supervised/test_models.py::test_moco - RuntimeError:...
FAILED tests/models/self_supervised/test_models.py::test_simclr - numpy.core....
FAILED tests/models/self_supervised/test_models.py::test_swav - MemoryError
FAILED tests/models/self_supervised/test_models.py::test_simsiam - RuntimeErr...
FAILED tests/models/self_supervised/test_resnets.py::test_cpc_resnet - Runtim...
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[resnet18]
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[resnet34]
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[resnet50]
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[resnet101]
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[resnet152]
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[resnext50_32x4d]
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[resnext101_32x8d]
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[wide_resnet50_2]
FAILED tests/models/self_supervised/test_resnets.py::test_torchvision_resnets[wide_resnet101_2]
FAILED tests/models/self_supervised/test_resnets.py::test_amdim_encoder[32]
FAILED tests/models/self_supervised/test_resnets.py::test_amdim_encoder[64]
FAILED tests/models/self_supervised/test_scripts.py::test_cli_run_self_supervised_amdim[--data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --max_steps 3 --fast_dev_run 1 --batch_size 2 --num_workers 0]
FAILED tests/models/self_supervised/test_scripts.py::test_cli_run_self_supervised_moco[ --data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --max_steps 3 --fast_dev_run 1 --batch_size 2 --num_workers 0]
FAILED tests/models/self_supervised/test_scripts.py::test_cli_run_self_supervised_simclr[ --data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --max_steps 3 --fast_dev_run 1 --batch_size 2 --num_workers 0 --online_ft --gpus 0 --fp32]
FAILED tests/models/self_supervised/test_scripts.py::test_cli_run_self_supervised_byol[ --data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --max_steps 3 --fast_dev_run 1 --batch_size 2 --num_workers 0 --online_ft]
FAILED tests/models/self_supervised/test_scripts.py::test_cli_run_self_supervised_swav[ --dataset cifar10 --data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --max_steps 3 --fast_dev_run 1 --batch_size 2 --arch resnet18 --hidden_mlp 512 --sinkhorn_iterations 1 --nmb_prototypes 2 --num_workers 0 --queue_length 0 --gpus 0 --fp32]
FAILED tests/models/self_supervised/test_scripts.py::test_cli_run_self_supervised_simsiam[ --dataset cifar10 --data_dir D:\\a\\pytorch-lightning-bolts\\pytorch-lightning-bolts\\datasets --max_epochs 1 --max_steps 3 --fast_dev_run 1 --batch_size 2 --num_workers 0 --online_ft --gpus 0 --fp32]
= 46 failed, 190 passed, 5 skipped, 1 xfailed, 53 warnings in 208.77s (0:03:28) =

Additional context

oke-aditya commented 3 years ago

Few stack traces from above run.

Maybe I think we should use @torch.no_grad() at places. Trying this in #531

E       RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:73] data. DefaultCPUAllocator: not enough memory: you tried to allocate 51380224 bytes. Buy new RAM!
 >   for chunk in iter(lambda: f.read(chunk_size), b''):
E   MemoryError

This is either MultiProcessing in python issue or pickle.

c:\hostedtoolcache\windows\python\3.6.8\x64\lib\multiprocessing\popen_spawn_win32.py:65: in __init__
    reduction.dump(process_obj, to_child)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <Process(Process-19, initial daemon)>, file = <_io.BufferedWriter name=11>
protocol = None

    def dump(obj, file, protocol=None):
        '''Replacement for pickle.dump() using ForkingPickler.'''
>       ForkingPickler(file, protocol).dump(obj)
E       BrokenPipeError: [Errno 32] Broken pipe

Another reason is perhaps we are trying to save a checkpoint on windows, which needs movement of file object from RAM to Disk and hence these errors. Maybe disabling checkpoints, or model saving wherever unnecessary can avoid these ?

ananyahjha93 commented 2 years ago

We are deprecating the self-supervised learning paper implementations in bolts and instead provided a VISSL integration in PyTorch Lightning Flash which can be used to train models using self-supervised learning algorithms. You can refer to the documentation here.