behavioral-data / Homekit2020

MIT License
15 stars 4 forks source link

Can't Pickle Local Object 'ActivityTask.__init__.<locals>._transform_row' when running first job #17

Closed mnodini closed 1 year ago

mnodini commented 2 years ago

From reading online it seems like this could be a multiprocessing issue. I've ran into this issue before and it seems like it might be an issue when you have access to more than one GPU


(Homekit2020) jovyan@jupyter-mnodini-40ucsd-2eedu:~/Tempredict-Shared-PersistentStorage/Homekit2020$ python src/models/train.py fit `# Main entry point` \
>         --config configs/tasks/HomekitPredictFluPos.yaml `# Configures the task`\
>         --config configs/models/CNNToTransformerClassifier.yaml `# Configures the model`\
>         --data.train_path  $PWD/data/processed/split/audere_split_2020_02_10/train_7_day  `# Train data location`\
>         --data.val_path $PWD/data/processed/split/audere_split_2020_02_10/eval_7_day  `# Validation data location`\
> 
Global seed set to 999
09/13/2022 17:34:35 - INFO - src.data.utils -   Reading lab_results_with_triggerdate...

wandb: Currently logged in as: mnodini. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.3
wandb: Run data is saved locally in /cephfs/tempredict-shared-space/Homekit2020/wandb/run-20220913_173437-2m1pp2jh
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run test-project
wandb: ā­ļø View project at https://wandb.ai/mnodini/test-project
wandb: šŸš€ View run at https://wandb.ai/mnodini/test-project/runs/2m1pp2jh
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Traceback (most recent call last):
  File "/cephfs/tempredict-shared-space/Homekit2020/src/models/train.py", line 200, in <module>
    cli = CLI(trainer_defaults=trainer_defaults,
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/utilities/cli.py", line 157, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/cli.py", line 350, in __init__
    self._run_subcommand(self.subcommand)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/cli.py", line 626, in _run_subcommand
    fn(**fn_kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 700, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 652, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes
    process.start()
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'ActivityTask.__init__.<locals>._transform_row'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:                                                                                
wandb: 
wandb: Run summary:
wandb: model CNNToTransformerClas...
wandb:  task PredictFluPos
wandb: 
wandb: Synced test-project: https://wandb.ai/mnodini/test-project/runs/2m1pp2jh
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20220913_173437-2m1pp2jh/logs
mnodini commented 2 years ago

I found the source of the issue:

image

When I comment out self.transform = ...

The pickling issue goes away, but another error is formed because of the missing self.transform assignment.


(Homekit2020) jovyan@jupyter-mnodini-40ucsd-2eedu:/cephfs/tempredict-shared-space/Homekit2020$ python src/models/train.py fit `# Main entry point`         --config configs/tasks/HomekitPredictFluPos.yaml `# Configures the task`        --config configs/models/CNNToTransformerClassifier.yaml `# Configures the model`        --data.train_path  $PWD/data/processed/split/audere_split_2020_02_10/train_7_day  `# Train data location`        --data.val_path $PWD/data/processed/split/audere_split_2020_02_10/eval_7_day  `# Validation data location`\
> 

Global seed set to 999
09/13/2022 20:06:28 - INFO - src.data.utils -   Reading lab_results_with_triggerdate...
wandb: Currently logged in as: mnodini. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.3
wandb: Run data is saved locally in /cephfs/tempredict-shared-space/Homekit2020/wandb/run-20220913_200636-2032ya92
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run test-project
wandb: ā­ļø View project at https://wandb.ai/mnodini/test-project
wandb: šŸš€ View run at https://wandb.ai/mnodini/test-project/runs/2032ya92
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
09/13/2022 20:06:57 - INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:1 to store for rank: 1
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
09/13/2022 20:07:02 - INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:1 to store for rank: 2
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:1 to store for rank: 3
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:1 to store for rank: 0
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d -   Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d -   Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d -   Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
09/13/2022 20:07:06 - INFO - torch.distributed.distributed_c10d -   Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name          | Type                      | Params
------------------------------------------------------------
0 | train_metrics | TorchMetricClassification | 0     
1 | val_metrics   | TorchMetricClassification | 0     
2 | test_metrics  | TorchMetricClassification | 0     
3 | criterion     | CrossEntropyLoss          | 0     
4 | encoder       | CNNToTransformerEncoder   | 43.2 K
5 | head          | ClassificationModule      | 2.6 K 
------------------------------------------------------------
45.8 K    Trainable params
0         Non-trainable params
45.8 K    Total params
0.183     Total estimated model params size (MB)
09/13/2022 20:07:21 - INFO - wandb -   multiprocessing start_methods=fork,spawn,forkserver, using: spawn
09/13/2022 20:07:21 - INFO - wandb -   config_cb None None {'metric_class': 'TorchMetricClassification', 'bootstrap_val_metrics': True, 'learning_rate': 0.001, 'warmup_steps': 20, 'batch_size': 800, 'input_shape': [10080, 8], 'num_attention_heads': 4, 'num_hidden_layers': 2, 'kernel_sizes': [5, 5, 2], 'out_channels': [8, 16, 32], 'stride_sizes': [5, 3, 2], 'dropout_rate': 0.3, 'num_labels': 2, 'positional_encoding': False, 'pretrained_ckpt_path': 'None', 'fields': ['heart_rate', 'missing_heart_rate', 'missing_steps', 'sleep_classic_0', 'sleep_classic_1', 'sleep_classic_2', 'sleep_classic_3', 'steps'], 'train_path': '/cephfs/tempredict-shared-space/Homekit2020/data/processed/split/audere_split_2020_02_10/train_7_day', 'val_path': '/cephfs/tempredict-shared-space/Homekit2020/data/processed/split/audere_split_2020_02_10/eval_7_day', 'test_path': 'None', 'downsample_negative_frac': 'None', 'shape': 'None', 'normalize_numerical': True, 'append_daily_features': False, 'daily_features_path': 'None', 'backend': 'petastorm', 'activity_level': 'minute'}
Traceback (most recent call last):
  File "/cephfs/tempredict-shared-space/Homekit2020/src/models/train.py", line 200, in <module>
    cli = CLI(trainer_defaults=trainer_defaults,
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/utilities/cli.py", line 157, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/cli.py", line 350, in __init__
    self._run_subcommand(self.subcommand)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/cli.py", line 626, in _run_subcommand
    fn(**fn_kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 700, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 652, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
    results = function(*args, **kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1282, in _run_train
    self.fit_loop.run()
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 195, in run
    self.on_run_start(*args, **kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 210, in on_run_start
    self.trainer.reset_train_dataloader(self.trainer.lightning_module)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1811, in reset_train_dataloader
    self.train_dataloader = self._data_connector._request_dataloader(RunningStage.TRAINING)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 430, in _request_dataloader
    dataloader = source.dataloader()
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 507, in dataloader
    return method()
  File "/cephfs/tempredict-shared-space/Homekit2020/src/models/tasks.py", line 365, in train_dataloader
    return PetastormDataLoader(make_reader(self.train_url,transform_spec=self.transform,
AttributeError: 'PredictFluPos' object has no attribute 'transform'
mnodini commented 2 years ago

https://stackoverflow.com/questions/69190970/python-attribute-error-cant-pickle-local-object-using-multiprocessing this article highlights a similar issue. I'm not sure how to implement the fix though

safranchik commented 2 years ago

+1

I'm having the same error, but only when using multiple GPUs (DDP is automatically employed when using multiple GPUs). When running the model on a single GPU, the error disappears.

TheMikeMerrill commented 2 years ago

Yup, this is a known bug while training on multiple GPUs. To be honest, I'm not sure how to fix it and won't be able to look into it deeply for a few weeks. For now, run on a single GPU (or fix it and submit a PR šŸ˜…)

mnodini commented 2 years ago

I think the problem could be that self.transform is a class attribute and not an instance attribute. When things get parallelized it may not be able to access the class attributes.

If you copy the code from task.py into a Notebook and run

dir(ActivityTask)

you'll see that the method _transform_row isn't listed under the properties and methods of the object, maybe because it's hidden behind an if statement?

if self.backend == "petastorm":

TheMikeMerrill commented 2 years ago

I think the main issue is that nested functions can't be pickled in python, so when the task is serialized for transfer between processes everything breaks: https://stackoverflow.com/questions/12019961/python-pickling-nested-functions

The solution would be to rewrite _transform_row as a callable object (as in the above SO post) and pass that to TransformSpec instead.

It might also be possible to delete _transform_row from the namespace after self.transform is created, but that might not work depending on whether TransformSpec copies the function or carries a pointer to it.

But like I said, feel free to experiment! I'd love to see pull requests from the community :)

TheMikeMerrill commented 2 years ago

@mnodini Were you able to work out a fix for this? Multi-GPU training is a priority for us, so if you haven't already I might get started on a patch.

mnodini commented 2 years ago

Worked on it for a bit but didn't lead to anything. Best of luck!

TheMikeMerrill commented 1 year ago

Should be fixed now!