behavioral-data / Homekit2020

MIT License
13 stars 4 forks source link

Running First Job: Can't Pickle Local Object #11

Closed mnodini closed 1 year ago

mnodini commented 2 years ago

Error when running first job

image
mnodini commented 2 years ago

Started up the instance the same as usual:

make create_environment 
activate Homekit2020 
pip install -e . 

However, for some reason today I'm running into the can't pickle local object error again.

I am able to access the GPUs from within the conda environment with this setup, just running into pickle issues:

(Homekit2020) jovyan@jupyter-mnodini-40ucsd-2eedu:/cephfs/tempredict-shared-space/Homekit2020$ python3
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> use_cuda = torch.cuda.is_available()
>>> if use_cuda:
...     print('__CUDNN VERSION:', torch.backends.cudnn.version())
...     print('__Number CUDA Devices:', torch.cuda.device_count())
...     print('__CUDA Device Name:',torch.cuda.get_device_name(0))
...     print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)
... 
__CUDNN VERSION: 8401
__Number CUDA Devices: 4
__CUDA Device Name: NVIDIA GeForce RTX 2080 Ti
__CUDA Device Total Memory [GB]: 11.554848768
>>> 
jovyan@jupyter-mnodini-40ucsd-2eedu:/cephfs/tempredict-shared-space/Homekit2020$ conda activate Homekit2020
(Homekit2020) jovyan@jupyter-mnodini-40ucsd-2eedu:/cephfs/tempredict-shared-space/Homekit2020$ python3 src/models/train.py fit `# Main entry point`         --config configs/tasks/HomekitPredictFluPos.yaml `# Configures the task`        --config configs/models/CNNToTransformerClassifier.yaml `# Configures the model`        --data.train_path  $PWD/data/processed/split/audere_split_2020_02_10/train_7_day  `# Train data location`        --data.val_path $PWD/data/processed/split/audere_split_2020_02_10/eval_7_day  `# Validation data location`\
> 
Global seed set to 999
09/09/2022 19:29:38 - INFO - src.data.utils -   Reading lab_results_with_triggerdate...
wandb: Currently logged in as: mnodini. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.3
wandb: Run data is saved locally in /cephfs/tempredict-shared-space/Homekit2020/wandb/run-20220909_192940-18war1bp
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run test-project
wandb: ⭐️ View project at https://wandb.ai/mnodini/test-project
wandb: 🚀 View run at https://wandb.ai/mnodini/test-project/runs/18war1bp
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Traceback (most recent call last):
  File "/cephfs/tempredict-shared-space/Homekit2020/src/models/train.py", line 200, in <module>
    cli = CLI(trainer_defaults=trainer_defaults,
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/utilities/cli.py", line 157, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/cli.py", line 350, in __init__
    self._run_subcommand(self.subcommand)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/cli.py", line 626, in _run_subcommand
    fn(**fn_kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 700, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 652, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/opt/conda/envs/Homekit2020/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes
    process.start()
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/envs/Homekit2020/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'ActivityTask.__init__.<locals>._transform_row'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:                                                                                
wandb: 
wandb: Run summary:
wandb: model CNNToTransformerClas...
wandb:  task PredictFluPos
wandb: 
wandb: Synced test-project: https://wandb.ai/mnodini/test-project/runs/18war1bp
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20220909_192940-18war1bp/logs
(Homekit2020) jovyan@jupyter-mnodini-40ucsd-2eedu:/cephfs/tempredict-shared-space/Homekit2020$ 
TheMikeMerrill commented 1 year ago

Fixed by #23