Closed mnodini closed 1 year ago
I found the source of the issue:
When I comment out self.transform = ...
The pickling issue goes away, but another error is formed because of the missing self.transform assignment.
(Homekit2020) jovyan@jupyter-mnodini-40ucsd-2eedu:/cephfs/tempredict-shared-space/Homekit2020$ python src/models/train.py fit `# Main entry point` --config configs/tasks/HomekitPredictFluPos.yaml `# Configures the task` --config configs/models/CNNToTransformerClassifier.yaml `# Configures the model` --data.train_path $PWD/data/processed/split/audere_split_2020_02_10/train_7_day `# Train data location` --data.val_path $PWD/data/processed/split/audere_split_2020_02_10/eval_7_day `# Validation data location`\
>
Global seed set to 999
09/13/2022 20:06:28 - INFO - src.data.utils - Reading lab_results_with_triggerdate...
wandb: Currently logged in as: mnodini. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.3
wandb: Run data is saved locally in /cephfs/tempredict-shared-space/Homekit2020/wandb/run-20220913_200636-2032ya92
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run test-project
wandb: āļø View project at https://wandb.ai/mnodini/test-project
wandb: š View run at https://wandb.ai/mnodini/test-project/runs/2032ya92
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
09/13/2022 20:06:57 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
09/13/2022 20:07:02 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
| Name | Type | Params
------------------------------------------------------------
0 | train_metrics | TorchMetricClassification | 0
1 | val_metrics | TorchMetricClassification | 0
2 | test_metrics | TorchMetricClassification | 0
3 | criterion | CrossEntropyLoss | 0
4 | encoder | CNNToTransformerEncoder | 43.2 K
5 | head | ClassificationModule | 2.6 K
------------------------------------------------------------
45.8 K Trainable params
0 Non-trainable params
45.8 K Total params
0.183 Total estimated model params size (MB)
09/13/2022 20:07:21 - INFO - wandb - multiprocessing start_methods=fork,spawn,forkserver, using: spawn
09/13/2022 20:07:21 - INFO - wandb - config_cb None None {'metric_class': 'TorchMetricClassification', 'bootstrap_val_metrics': True, 'learning_rate': 0.001, 'warmup_steps': 20, 'batch_size': 800, 'input_shape': [10080, 8], 'num_attention_heads': 4, 'num_hidden_layers': 2, 'kernel_sizes': [5, 5, 2], 'out_channels': [8, 16, 32], 'stride_sizes': [5, 3, 2], 'dropout_rate': 0.3, 'num_labels': 2, 'positional_encoding': False, 'pretrained_ckpt_path': 'None', 'fields': ['heart_rate', 'missing_heart_rate', 'missing_steps', 'sleep_classic_0', 'sleep_classic_1', 'sleep_classic_2', 'sleep_classic_3', 'steps'], 'train_path': '/cephfs/tempredict-shared-space/Homekit2020/data/processed/split/audere_split_2020_02_10/train_7_day', 'val_path': '/cephfs/tempredict-shared-space/Homekit2020/data/processed/split/audere_split_2020_02_10/eval_7_day', 'test_path': 'None', 'downsample_negative_frac': 'None', 'shape': 'None', 'normalize_numerical': True, 'append_daily_features': False, 'daily_features_path': 'None', 'backend': 'petastorm', 'activity_level': 'minute'}
Traceback (most recent call last):
File "/cephfs/tempredict-shared-space/Homekit2020/src/models/train.py", line 200, in <module>
cli = CLI(trainer_defaults=trainer_defaults,
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/utilities/cli.py", line 157, in __init__
super().__init__(*args, **kwargs)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/cli.py", line 350, in __init__
self._run_subcommand(self.subcommand)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/cli.py", line 626, in _run_subcommand
fn(**fn_kwargs)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 700, in fit
self._call_and_handle_interrupt(
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 652, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
mp.start_processes(
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
results = function(*args, **kwargs)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1282, in _run_train
self.fit_loop.run()
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 195, in run
self.on_run_start(*args, **kwargs)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 210, in on_run_start
self.trainer.reset_train_dataloader(self.trainer.lightning_module)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1811, in reset_train_dataloader
self.train_dataloader = self._data_connector._request_dataloader(RunningStage.TRAINING)
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 430, in _request_dataloader
dataloader = source.dataloader()
File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 507, in dataloader
return method()
File "/cephfs/tempredict-shared-space/Homekit2020/src/models/tasks.py", line 365, in train_dataloader
return PetastormDataLoader(make_reader(self.train_url,transform_spec=self.transform,
AttributeError: 'PredictFluPos' object has no attribute 'transform'
https://stackoverflow.com/questions/69190970/python-attribute-error-cant-pickle-local-object-using-multiprocessing this article highlights a similar issue. I'm not sure how to implement the fix though
+1
I'm having the same error, but only when using multiple GPUs (DDP is automatically employed when using multiple GPUs). When running the model on a single GPU, the error disappears.
Yup, this is a known bug while training on multiple GPUs. To be honest, I'm not sure how to fix it and won't be able to look into it deeply for a few weeks. For now, run on a single GPU (or fix it and submit a PR š )
I think the problem could be that self.transform is a class attribute and not an instance attribute. When things get parallelized it may not be able to access the class attributes.
If you copy the code from task.py into a Notebook and run
dir(ActivityTask)
you'll see that the method _transform_row isn't listed under the properties and methods of the object, maybe because it's hidden behind an if statement?
if self.backend == "petastorm":
I think the main issue is that nested functions can't be pickled in python, so when the task is serialized for transfer between processes everything breaks: https://stackoverflow.com/questions/12019961/python-pickling-nested-functions
The solution would be to rewrite _transform_row
as a callable object (as in the above SO post) and pass that to TransformSpec
instead.
It might also be possible to delete _transform_row
from the namespace after self.transform is created, but that might not work depending on whether TransformSpec
copies the function or carries a pointer to it.
But like I said, feel free to experiment! I'd love to see pull requests from the community :)
@mnodini Were you able to work out a fix for this? Multi-GPU training is a priority for us, so if you haven't already I might get started on a patch.
Worked on it for a bit but didn't lead to anything. Best of luck!
Should be fixed now!
From reading online it seems like this could be a multiprocessing issue. I've ran into this issue before and it seems like it might be an issue when you have access to more than one GPU