Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
262 stars 30 forks source link

Add osx-arm64 platform to conda-lock.yml file and GitHub Actions CI #164

Closed weiji14 closed 4 months ago

weiji14 commented 5 months ago

Support installation on macOS ARM64 devices (M1 chips) by:

References:

Addresses https://github.com/Clay-foundation/model/issues/161 and extends #162.

weiji14 commented 5 months ago

Test on macos-14 failing at https://github.com/Clay-foundation/model/actions/runs/8042592659/job/21963448881#step:4:79:

```python-traceback ============================= test session starts ============================== platform darwin -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0 -- /Users/runner/micromamba/envs/claymodel/bin/python cachedir: .pytest_cache rootdir: /Users/runner/work/model/model plugins: anyio-4.3.0 collecting ... collected 16 items src/tests/test_callbacks.py::test_callbacks_wandb_log_mae_reconstruction PASSED [ 6%] src/tests/test_datamodule.py::test_datapipemodule[fit-train_dataloader-ClayDataModule] PASSED [ 12%] src/tests/test_datamodule.py::test_datapipemodule[fit-train_dataloader-GeoTIFFDataPipeModule] PASSED [ 18%] src/tests/test_datamodule.py::test_datapipemodule[predict-predict_dataloader-ClayDataModule] PASSED [ 25%] src/tests/test_datamodule.py::test_datapipemodule[predict-predict_dataloader-GeoTIFFDataPipeModule] PASSED [ 31%] src/tests/test_datamodule.py::test_geotiffdatapipemodule_list_from_s3_bucket PASSED [ 37%] src/tests/test_model.py::test_model_vit_fit FAILED [ 43%] src/tests/test_model.py::test_model_predict[mean-CLAYModule-32-true] FAILED [ 50%] src/tests/test_model.py::test_model_predict[mean-ViTLitModule-16-mixed] FAILED [ 56%] src/tests/test_model.py::test_model_predict[patch-CLAYModule-32-true] FAILED [ 62%] src/tests/test_model.py::test_model_predict[patch-ViTLitModule-16-mixed] FAILED [ 68%] src/tests/test_model.py::test_model_predict[group-CLAYModule-32-true] FAILED [ 75%] src/tests/test_model.py::test_model_predict[group-ViTLitModule-16-mixed] FAILED [ 81%] src/tests/test_trainer.py::test_cli_main[fit] PASSED [ 87%] src/tests/test_trainer.py::test_cli_main[validate] PASSED [ 93%] src/tests/test_trainer.py::test_cli_main[test] PASSED [100%] =================================== FAILURES =================================== ______________________________ test_model_vit_fit ______________________________ datapipe = IterableWrapperIterDataPipe def test_model_vit_fit(datapipe): """ Run a full train and validation loop using 1 batch. """ # Get some random data dataloader = torchdata.dataloader2.DataLoader2(datapipe=datapipe) # Initialize model model: L.LightningModule = ViTLitModule() # Run tests in a temporary folder with tempfile.TemporaryDirectory() as tmpdirname: # Training trainer: L.Trainer = L.Trainer( accelerator="auto", devices=1, precision="16-mixed", fast_dev_run=True, default_root_dir=tmpdirname, ) > trainer.fit(model=model, train_dataloaders=dataloader) src/tests/test_model.py:84: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:544: in fit call._call_and_handle_interrupt( ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py:44: in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:580: in _fit_impl self._run(model, ckpt_path=ckpt_path) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:965: in _run self.strategy.setup(self) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:77: in setup self.model_to_device() ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:74: in model_to_device self.model.to(self.root_device) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py:54: in to return super().to(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1160: in to return self._apply(convert) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:833: in _apply param_applied = fn(param) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t = Parameter containing: tensor([[[[-7.8660e-03, -3.7126e-03, 1.1882e-02, ..., 3.5900e-03, 2.2451e-02, 4....[ 1.9821e-02, -1.4641e-02, -3.9173e-02, ..., -1.5309e-02, -3.2961e-02, 2.5180e-02]]]], requires_grad=True) def convert(t): if convert_to_format is not None and t.dim() in (4, 5): return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking, memory_format=convert_to_format) > return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) E RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 156.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1158: RuntimeError ----------------------------- Captured stderr call ----------------------------- INFO: Using 16bit Automatic Mixed Precision (AMP) INFO: GPU available: True (mps), used: True INFO: TPU available: False, using: 0 TPU cores INFO: IPU available: False, using: 0 IPUs INFO: HPU available: False, using: 0 HPUs INFO: Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. ------------------------------ Captured log call ------------------------------- INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Using 16bit Automatic Mixed Precision (AMP) INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 GPU available: True (mps), used: True INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 TPU available: False, using: 0 TPU cores INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 IPU available: False, using: 0 IPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 HPU available: False, using: 0 HPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. _________________ test_model_predict[mean-CLAYModule-32-true] __________________ datapipe = IterableWrapperIterDataPipe litmodule = , precision = '32-true' embeddings_level = 'mean' @pytest.mark.parametrize( "litmodule,precision", [ (CLAYModule, "16-mixed" if torch.cuda.is_available() else "32-true"), (ViTLitModule, "16-mixed"), ], ) @pytest.mark.parametrize("embeddings_level", ["mean", "patch", "group"]) def test_model_predict(datapipe, litmodule, precision, embeddings_level): """ Run a single prediction loop using 1 batch. """ # Get some random data dataloader = torchdata.dataloader2.DataLoader2(datapipe=datapipe) # Initialize model if litmodule == CLAYModule: litargs = { "embeddings_level": embeddings_level, } else: litargs = {} model: L.LightningModule = litmodule(**litargs) # Run tests in a temporary folder with tempfile.TemporaryDirectory() as tmpdirname: # Training trainer: L.Trainer = L.Trainer( accelerator="auto", devices="auto", precision=precision, fast_dev_run=True, default_root_dir=tmpdirname, ) # Prediction > trainer.predict(model=model, dataloaders=dataloader) src/tests/test_model.py:124: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:864: in predict return call._call_and_handle_interrupt( ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py:44: in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:903: in _predict_impl results = self._run(model, ckpt_path=ckpt_path) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:965: in _run self.strategy.setup(self) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:77: in setup self.model_to_device() ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:74: in model_to_device self.model.to(self.root_device) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py:54: in to return super().to(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1160: in to return self._apply(convert) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:833: in _apply param_applied = fn(param) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t = Parameter containing: tensor([[-0.6110, 0.1869], [-0.5562, -0.1639], [ 0.4407, 0.0668], ..., [-0.2467, 0.0192], [ 0.3921, 0.6417], [ 0.1073, -0.1948]], requires_grad=True) def convert(t): if convert_to_format is not None and t.dim() in (4, 5): return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking, memory_format=convert_to_format) > return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) E RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 6.00 KB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1158: RuntimeError ----------------------------- Captured stderr call ----------------------------- INFO: GPU available: True (mps), used: True INFO: TPU available: False, using: 0 TPU cores INFO: IPU available: False, using: 0 IPUs INFO: HPU available: False, using: 0 HPUs INFO: Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. ------------------------------ Captured log call ------------------------------- INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 GPU available: True (mps), used: True INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 TPU available: False, using: 0 TPU cores INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 IPU available: False, using: 0 IPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 HPU available: False, using: 0 HPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. ________________ test_model_predict[mean-ViTLitModule-16-mixed] ________________ datapipe = IterableWrapperIterDataPipe litmodule = , precision = '16-mixed' embeddings_level = 'mean' @pytest.mark.parametrize( "litmodule,precision", [ (CLAYModule, "16-mixed" if torch.cuda.is_available() else "32-true"), (ViTLitModule, "16-mixed"), ], ) @pytest.mark.parametrize("embeddings_level", ["mean", "patch", "group"]) def test_model_predict(datapipe, litmodule, precision, embeddings_level): """ Run a single prediction loop using 1 batch. """ # Get some random data dataloader = torchdata.dataloader2.DataLoader2(datapipe=datapipe) # Initialize model if litmodule == CLAYModule: litargs = { "embeddings_level": embeddings_level, } else: litargs = {} model: L.LightningModule = litmodule(**litargs) # Run tests in a temporary folder with tempfile.TemporaryDirectory() as tmpdirname: # Training trainer: L.Trainer = L.Trainer( accelerator="auto", devices="auto", precision=precision, fast_dev_run=True, default_root_dir=tmpdirname, ) # Prediction > trainer.predict(model=model, dataloaders=dataloader) src/tests/test_model.py:124: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:864: in predict return call._call_and_handle_interrupt( ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py:44: in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:903: in _predict_impl results = self._run(model, ckpt_path=ckpt_path) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:965: in _run self.strategy.setup(self) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:77: in setup self.model_to_device() ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:74: in model_to_device self.model.to(self.root_device) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py:54: in to return super().to(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1160: in to return self._apply(convert) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:833: in _apply param_applied = fn(param) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t = Parameter containing: tensor([[[[-1.8605e-02, -7.1301e-03, 7.6448e-03, ..., 4.5676e-02, -1.9168e-02, 1....[-2.3598e-02, -1.7550e-02, -3.5928e-03, ..., 1.3235e-02, 1.6877e-02, 5.0347e-02]]]], requires_grad=True) def convert(t): if convert_to_format is not None and t.dim() in (4, 5): return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking, memory_format=convert_to_format) > return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) E RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 156.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1158: RuntimeError ----------------------------- Captured stderr call ----------------------------- INFO: Using 16bit Automatic Mixed Precision (AMP) INFO: GPU available: True (mps), used: True INFO: TPU available: False, using: 0 TPU cores INFO: IPU available: False, using: 0 IPUs INFO: HPU available: False, using: 0 HPUs INFO: Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. ------------------------------ Captured log call ------------------------------- INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Using 16bit Automatic Mixed Precision (AMP) INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 GPU available: True (mps), used: True INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 TPU available: False, using: 0 TPU cores INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 IPU available: False, using: 0 IPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 HPU available: False, using: 0 HPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. _________________ test_model_predict[patch-CLAYModule-32-true] _________________ datapipe = IterableWrapperIterDataPipe litmodule = , precision = '32-true' embeddings_level = 'patch' @pytest.mark.parametrize( "litmodule,precision", [ (CLAYModule, "16-mixed" if torch.cuda.is_available() else "32-true"), (ViTLitModule, "16-mixed"), ], ) @pytest.mark.parametrize("embeddings_level", ["mean", "patch", "group"]) def test_model_predict(datapipe, litmodule, precision, embeddings_level): """ Run a single prediction loop using 1 batch. """ # Get some random data dataloader = torchdata.dataloader2.DataLoader2(datapipe=datapipe) # Initialize model if litmodule == CLAYModule: litargs = { "embeddings_level": embeddings_level, } else: litargs = {} model: L.LightningModule = litmodule(**litargs) # Run tests in a temporary folder with tempfile.TemporaryDirectory() as tmpdirname: # Training trainer: L.Trainer = L.Trainer( accelerator="auto", devices="auto", precision=precision, fast_dev_run=True, default_root_dir=tmpdirname, ) # Prediction > trainer.predict(model=model, dataloaders=dataloader) src/tests/test_model.py:124: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:864: in predict return call._call_and_handle_interrupt( ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py:44: in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:903: in _predict_impl results = self._run(model, ckpt_path=ckpt_path) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:965: in _run self.strategy.setup(self) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:77: in setup self.model_to_device() ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:74: in model_to_device self.model.to(self.root_device) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py:54: in to return super().to(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1160: in to return self._apply(convert) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:833: in _apply param_applied = fn(param) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t = Parameter containing: tensor([[ 0.4197, 0.7064], [-0.3017, -0.5617], [ 0.3429, -0.5639], ..., [-0.0225, 0.3200], [ 0.1219, -0.0069], [ 0.4542, -0.6156]], requires_grad=True) def convert(t): if convert_to_format is not None and t.dim() in (4, 5): return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking, memory_format=convert_to_format) > return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) E RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 6.00 KB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1158: RuntimeError ----------------------------- Captured stderr call ----------------------------- INFO: GPU available: True (mps), used: True INFO: TPU available: False, using: 0 TPU cores INFO: IPU available: False, using: 0 IPUs INFO: HPU available: False, using: 0 HPUs INFO: Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. ------------------------------ Captured log call ------------------------------- INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 GPU available: True (mps), used: True INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 TPU available: False, using: 0 TPU cores INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 IPU available: False, using: 0 IPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 HPU available: False, using: 0 HPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. _______________ test_model_predict[patch-ViTLitModule-16-mixed] ________________ datapipe = IterableWrapperIterDataPipe litmodule = , precision = '16-mixed' embeddings_level = 'patch' @pytest.mark.parametrize( "litmodule,precision", [ (CLAYModule, "16-mixed" if torch.cuda.is_available() else "32-true"), (ViTLitModule, "16-mixed"), ], ) @pytest.mark.parametrize("embeddings_level", ["mean", "patch", "group"]) def test_model_predict(datapipe, litmodule, precision, embeddings_level): """ Run a single prediction loop using 1 batch. """ # Get some random data dataloader = torchdata.dataloader2.DataLoader2(datapipe=datapipe) # Initialize model if litmodule == CLAYModule: litargs = { "embeddings_level": embeddings_level, } else: litargs = {} model: L.LightningModule = litmodule(**litargs) # Run tests in a temporary folder with tempfile.TemporaryDirectory() as tmpdirname: # Training trainer: L.Trainer = L.Trainer( accelerator="auto", devices="auto", precision=precision, fast_dev_run=True, default_root_dir=tmpdirname, ) # Prediction > trainer.predict(model=model, dataloaders=dataloader) src/tests/test_model.py:124: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:864: in predict return call._call_and_handle_interrupt( ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py:44: in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:903: in _predict_impl results = self._run(model, ckpt_path=ckpt_path) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:965: in _run self.strategy.setup(self) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:77: in setup self.model_to_device() ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:74: in model_to_device self.model.to(self.root_device) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py:54: in to return super().to(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1160: in to return self._apply(convert) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:833: in _apply param_applied = fn(param) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t = Parameter containing: tensor([[[[ 1.6769e-02, 2.0459e-04, 3.0793e-02, ..., 4.2431e-02, 1.5604e-03, 2....[ 1.7419e-02, 6.7345e-04, -5.1334e-03, ..., -2.9248e-02, -1.8247e-02, 3.1454e-02]]]], requires_grad=True) def convert(t): if convert_to_format is not None and t.dim() in (4, 5): return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking, memory_format=convert_to_format) > return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) E RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 156.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1158: RuntimeError ----------------------------- Captured stderr call ----------------------------- INFO: Using 16bit Automatic Mixed Precision (AMP) INFO: GPU available: True (mps), used: True INFO: TPU available: False, using: 0 TPU cores INFO: IPU available: False, using: 0 IPUs INFO: HPU available: False, using: 0 HPUs INFO: Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. ------------------------------ Captured log call ------------------------------- INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Using 16bit Automatic Mixed Precision (AMP) INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 GPU available: True (mps), used: True INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 TPU available: False, using: 0 TPU cores INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 IPU available: False, using: 0 IPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 HPU available: False, using: 0 HPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. _________________ test_model_predict[group-CLAYModule-32-true] _________________ datapipe = IterableWrapperIterDataPipe litmodule = , precision = '32-true' embeddings_level = 'group' @pytest.mark.parametrize( "litmodule,precision", [ (CLAYModule, "16-mixed" if torch.cuda.is_available() else "32-true"), (ViTLitModule, "16-mixed"), ], ) @pytest.mark.parametrize("embeddings_level", ["mean", "patch", "group"]) def test_model_predict(datapipe, litmodule, precision, embeddings_level): """ Run a single prediction loop using 1 batch. """ # Get some random data dataloader = torchdata.dataloader2.DataLoader2(datapipe=datapipe) # Initialize model if litmodule == CLAYModule: litargs = { "embeddings_level": embeddings_level, } else: litargs = {} model: L.LightningModule = litmodule(**litargs) # Run tests in a temporary folder with tempfile.TemporaryDirectory() as tmpdirname: # Training trainer: L.Trainer = L.Trainer( accelerator="auto", devices="auto", precision=precision, fast_dev_run=True, default_root_dir=tmpdirname, ) # Prediction > trainer.predict(model=model, dataloaders=dataloader) src/tests/test_model.py:124: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:864: in predict return call._call_and_handle_interrupt( ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py:44: in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:903: in _predict_impl results = self._run(model, ckpt_path=ckpt_path) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py:965: in _run self.strategy.setup(self) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:77: in setup self.model_to_device() ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/single_device.py:74: in model_to_device self.model.to(self.root_device) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py:54: in to return super().to(*args, **kwargs) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1160: in to return self._apply(convert) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:810: in _apply module._apply(fn) ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:833: in _apply param_applied = fn(param) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t = Parameter containing: tensor([[-0.1077, 0.0463], [ 0.0930, -0.4663], [ 0.5140, 0.1373], ..., [ 0.3736, 0.0872], [ 0.5844, 0.0184], [ 0.5122, 0.6130]], requires_grad=True) def convert(t): if convert_to_format is not None and t.dim() in (4, 5): return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking, memory_format=convert_to_format) > return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) E RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 6.00 KB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). ../../../micromamba/envs/claymodel/lib/python3.11/site-packages/torch/nn/modules/module.py:1158: RuntimeError ----------------------------- Captured stderr call ----------------------------- INFO: GPU available: True (mps), used: True INFO: TPU available: False, using: 0 TPU cores INFO: IPU available: False, using: 0 IPUs INFO: HPU available: False, using: 0 HPUs INFO: Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. ------------------------------ Captured log call ------------------------------- INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 GPU available: True (mps), used: True INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 TPU available: False, using: 0 TPU cores INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 IPU available: False, using: 0 IPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 HPU available: False, using: 0 HPUs INFO lightning.pytorch.utilities.rank_zero:rank_zero.py:64 Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed. _______________ test_model_predict[group-ViTLitModule-16-mixed] ________________ datapipe = IterableWrapperIterDataPipe litmodule = , precision = '16-mixed' embeddings_level = 'group' @pytest.mark.parametrize( "litmodule,precision", [ (CLAYModule, "16-mixed" if torch.cuda.is_available() else "32-true"), (ViTLitModule, "16-mixed"), ], ) @pytest.mark.parametrize("embeddings_level", ["mean", "patch", "group"]) def test_model_predict(datapipe, litmodule, precision, embeddings_level): """ Run a single prediction loop using 1 batch. """ # Get some random data dataloader = torchdata.dataloader2.DataLoader2(datapipe=datapipe) # Initialize model Unable to serialize instance warning(val) src/tests/test_trainer.py::test_cli_main[validate] /Users/runner/micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['--verbose', 'src/tests/'], args=['validate', '--print_config=skip_null']. src/tests/test_trainer.py::test_cli_main[test] /Users/runner/micromamba/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['--verbose', 'src/tests/'], args=['test', '--print_config=skip_null']. -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED src/tests/test_model.py::test_model_vit_fit - RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 156.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). FAILED src/tests/test_model.py::test_model_predict[mean-CLAYModule-32-true] - RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 6.00 KB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). FAILED src/tests/test_model.py::test_model_predict[mean-ViTLitModule-16-mixed] - RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 156.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). FAILED src/tests/test_model.py::test_model_predict[patch-CLAYModule-32-true] - RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 6.00 KB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). FAILED src/tests/test_model.py::test_model_predict[patch-ViTLitModule-16-mixed] - RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 156.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). FAILED src/tests/test_model.py::test_model_predict[group-CLAYModule-32-true] - RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 6.00 KB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). FAILED src/tests/test_model.py::test_model_predict[group-ViTLitModule-16-mixed] - RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 156.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). ============= 7 failed, 9 passed, 28 warnings in 67.73s (0:01:07) ============== ```

Main error message is RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Might need to try setting PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 as suggested, or xfail test_model.py on macos-14.

@leothomas, if you have time, could you try installing from the environment.yml/conda-lock.yml file in this branch on your macOS M1 computer, and see if the docs/partial-inputs.ipynb notebook works? The torchvision/torchdata conda-forge incompatibility seems to have gone away today after https://github.com/conda-forge/torchvision-feedstock/pull/89.

leothomas commented 4 months ago

Did a fresh install with micromamba and was able to run the test without errors on this branch - although I just realized that I have a Mac M2 and not M1 (not sure if that made an important difference)

weiji14 commented 4 months ago

Did a fresh install with micromamba and was able to run the test without errors on this branch - although I just realized that I have a Mac M2 and not M1 (not sure if that made an important difference)

Cool, M2 should be fine too (probably more memory than M1). I tried setting PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 in the CI but it still fails. Looking at https://github.com/actions/runner-images/issues/9254#issuecomment-1936326374 and https://discuss.pytorch.org/t/mps-back-end-out-of-memory-on-github-action/189773/2, it seems like the GitHub Actions runners don't have access to the underlying Metal Performance Shaders (MPS) hardware unfortunately, so we might need to fallback to using CPU on the macos-14 CI.

weiji14 commented 4 months ago

Looks good to me. I'm approving, but I don't know if you want to wait for @leothomas to test using his M1 Mac, since mine is pre-M1.

Thanks @chuckwondo for reviewing, I'll merge this in first so that a Mac M1 user can get the install working on their device (https://github.com/Clay-foundation/model/issues/161#issuecomment-2002602847), and will let Leo test things later once the changes here get incoporated into #166 as mentioned at https://github.com/Clay-foundation/model/pull/164#discussion_r1527612297