checkpoints not found - Githubissues

tinybitdesu commented 5 months ago

hello, I am currently attempting to replicate your efforts. However, I met a problem during my attempt to bash train.py following your readme. I got a callback: 2 %7UYK2E$I8V~ZBU_D3UT8 I used Trained free model tf_thumos.yaml, basically I think I do not need to train it, but where can I find checkpoints to start training process? Thanks for your help and your diligent work!

benedettaliberatori commented 5 months ago

Hi and thank you for your interest in our work!

We do not provide checkpoints because our model is adapted at test time on each video sample and therefore we do not have a final trained model. This is true for both the training-free baseline (tf_<dataset_name>.yaml) and for the main method (tf_<dataset_name>.yaml). When you run it it will load pre-trained coca_ViT-L-14 weights from open_clip.

Are you following all the steps in the readme? Could you please include the command that causes this error?

tinybitdesu commented 5 months ago

Hello, Following your kindly instructions I overcomed the previous bug, however, I found a new problem, specifically, there seems no predicted label in my dataframe, you can see that my dataframe is empty, it has no rows and columns, which causes numpy to throw an error. I want to know how to proceed to the next step of debug. Thank you again for any help! I think I have listed the specific parameters below, which may be helpful.

[2024-05-15 14:48:45,715][src.utils.utils][INFO] - Enforcing tags! [2024-05-15 14:48:45,722][src.utils.utils][INFO] - Printing config tree with Rich! CONFIG ├── data │ └── target: src.data.custom_datamodule.T3ALDataModule
│ batch_size: 1
│ num_workers: 1
│ pin_memory: false
│ nsplit: 0
│ config: ./config/thumos.yaml
│
├── model │ └── target: src.models.tf_method_module.T3AL0Module
│ split: 0
│ dataset: thumos
│ setting: 50
│ video_path: .../T3AL/thumos_features
│ net:
│ target: src.models.components.tf_method.T3AL0Net
│ dataset: thumos
│ split: 0
│ setting: 50
│ kernel_size: 20
│ stride: 20
│ visualize: false
│ normalize: true
│ remove_background: true
│ video_path: .../T3AL/thumos_features
│
├── callbacks │ └── model_checkpoint:
│ target: lightning.pytorch.callbacks.ModelCheckpoint
│ dirpath: .../T3AL/logs/train/runs/2024-05-1514-48-45/checkpoints
│ filename: epoch{epoch:03d}
│ monitor: null
│ verbose: false
│ save_last: true
│ save_top_k: 1
│ mode: max
│ auto_insert_metric_name: false
│ save_weights_only: false
│ every_n_train_steps: null
│ train_time_interval: null
│ every_n_epochs: null
│ save_on_train_epoch_end: null
│ model_summary:
│ target: lightning.pytorch.callbacks.RichModelSummary
│ max_depth: -1
│ rich_progress_bar:
│ target: lightning.pytorch.callbacks.RichProgressBar
│
├── logger │ └── wandb:
│ target: lightning.pytorch.loggers.wandb.WandbLogger
│ save_dir: .../T3AL/logs/train/runs/2024-05-15_14-48-45
│ offline: false
│ id: null
│ anonymous: null
│ project: TAD
│ log_model: false
│ prefix: ''
│ group: ''
│ tags: []
│ job_type: ''
│ name: thumos
│
├── trainer │ └── target: lightning.pytorch.trainer.Trainer
│ default_root_dir: .../T3AL/logs/train/runs/2024-05-15_14-48-45
│ min_epochs: 1
│ max_epochs: 0
│ accelerator: cpu
│ devices: 1
│ check_val_every_n_epoch: 1
│ deterministic: false
│
├── paths │ └── root_dir: .../T3AL
│ data_dir: .../T3AL/data/
│ log_dir: .../T3AL/logs/
│ output_dir: .../T3AL/logs/train/runs/2024-05-15_14-48-45
│ work_dir: .../T3AL
│
├── extras │ └── ignore_warnings: false
│ enforce_tags: true
│ print_config: true
│
├── task_name │ └── train
├── tags │ └── ['dev']
├── train │ └── False
├── test │ └── True
├── compile │ └── False
├── ckpt_path │ └── “”
├── seed │ └── 12345
└── exp_name └── thumos
Seed set to 12345 [2024-05-15 14:48:45,836][main][INFO] - Instantiating datamodule [2024-05-15 14:48:46,155][main][INFO] - Instantiating model [2024-05-15 14:48:46,938][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmpgyr4cmla [2024-05-15 14:48:46,938][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmpgyr4cmla/_remote_module_non_scriptable.py [2024-05-15 14:48:47,188][root][INFO] - Loaded coca_ViT-L-14 model config. [2024-05-15 14:48:53,764][root][INFO] - Loading pretrained coca_ViT-L-14 weights (mscoco_finetuned_laion2B-s13B-b90k). Loaded COCA model [2024-05-15 14:48:55,982][main][INFO] - Instantiating callbacks... [2024-05-15 14:48:55,982][src.utils.instantiators][INFO] - Instantiating callback [2024-05-15 14:48:55,986][src.utils.instantiators][INFO] - Instantiating callback [2024-05-15 14:48:55,986][src.utils.instantiators][INFO] - Instantiating callback [2024-05-15 14:48:55,987][main][INFO] - Instantiating loggers... [2024-05-15 14:48:55,987][src.utils.instantiators][INFO] - Instantiating logger [2024-05-15 14:48:56,047][main][INFO] - Instantiating trainer .../miniconda3/lib/python3.8/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python src/train.py experiment=tf_thumos data=thumos model. ... Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default ModelSummary callback. GPU available: True (cuda), used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs .../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/setup.py:187: GPU available but not used. You can set it by doing Trainer(accelerator='gpu'). [2024-05-15 14:48:56,140][main][INFO] - Logging hyperparameters! wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: WARNING resume will be ignored since W&B syncing is set to offline. Starting a new run with run id dyxora2w. wandb: Tracking run with wandb version 0.17.0 wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. [2024-05-15 14:48:59,596][main][INFO] - Starting testing! [2024-05-15 14:48:59,597][main][WARNING] - Best ckpt not found! Using current weights for testing... No of videos in train is 214 Loading train Video Information ... No of class 10 No of videos in validation is 203 Loading validation Video Information ... No of class 10 .../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=19in theDataLoader` to improve performance. Start testing... video-id t-start t-end label 0 video_validation_0000365 18.1 24.3 HighJump 1 video_validation_0000365 29.6 33.3 HighJump 2 video_validation_0000365 69.7 77.3 HighJump 3 video_validation_0000365 80.8 84.3 HighJump 4 video_validation_0000365 110.4 116.2 HighJump Empty DataFrame Columns: [] Index: [] Ground truth labels: ['HighJump' 'PoleVault' 'TennisSwing' 'GolfSwing' 'HammerThrow' 'Billiards' 'BaseballPitch' 'CleanAndJerk' 'ThrowDiscus' 'SoccerPenalty'] Testing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 203/203 0:00:10 • 0:00:00 19.94it/s
[2024-05-15 14:49:10,282][src.utils.utils][ERROR] - Traceback (most recent call last): File ".../T3AL/src/utils/utils.py", line 65, in wrap metric_dict, object_dict = task_func(cfg=cfg) File "src/train.py", line 103, in train trainer.test(model=model, datamodule=datamodule, ckpt_path=ckpt_path) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 754, in test return call._call_and_handle_interrupt( File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 794, in _test_impl results = self._run(model, ckpt_path=ckpt_path) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run results = self._run_stage() File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage return self._evaluation_loop.run() File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator return loop_run(self, *args, *kwargs) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 142, in run return self.on_run_end() File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 254, in on_run_end self._on_evaluation_epoch_end() File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 334, in _on_evaluation_epoch_end call._call_lightning_module_hook(trainer, hook_name) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(args, kwargs) File ".../T3AL/src/models/tf_method_module.py", line 117, in on_test_epoch_end aps, tious = evaluate(self.dataset, self.predictions, self.split, self.setting, self.video_path) File ".../T3AL/src/evaluate.py", line 226, in evaluate print("Predicted labels: ", predicted["label"].unique()) File ".../miniconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 3761, in getitem indexer = self.columns.get_loc(key) File ".../miniconda3/lib/python3.8/site-packages/pandas/core/indexes/range.py", line 349, in get_loc raise KeyError(key) KeyError: 'label' [2024-05-15 14:49:10,290][src.utils.utils][INFO] - Output dir: .../T3AL/logs/train/runs/2024-05-15_14-48-45 [2024-05-15 14:49:10,290][src.utils.utils][INFO] - Closing wandb! wandb:
wandb: You can sync this run to the cloud by running: wandb: wandb sync .../T3AL/logs/train/runs/2024-05-15_14-48-45/wandb/offline-run-20240515_144858-dyxora2w wandb: Find logs at: ./logs/train/runs/2024-05-15_14-48-45/wandb/offline-run-20240515_144858-dyxora2w/logs Error executing job with overrides: ['experiment=tf_thumos', 'data=thumos', 'model.video_path=.../T3AL/thumos_features'] Traceback (most recent call last): File "src/train.py", line 121, in main metricdict, = train(cfg) File ".../T3AL/src/utils/utils.py", line 75, in wrap raise ex File ".../T3AL/src/utils/utils.py", line 65, in wrap metric_dict, object_dict = task_func(cfg=cfg) File "src/train.py", line 103, in train trainer.test(model=model, datamodule=datamodule, ckpt_path=ckpt_path) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 754, in test return call._call_and_handle_interrupt( File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 794, in _test_impl results = self._run(model, ckpt_path=ckpt_path) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run results = self._run_stage() File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage return self._evaluation_loop.run() File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator return loop_run(self, *args, *kwargs) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 142, in run return self.on_run_end() File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 254, in on_run_end self._on_evaluation_epoch_end() File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 334, in _on_evaluation_epoch_end call._call_lightning_module_hook(trainer, hook_name) File ".../miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(args, kwargs) File ".../T3AL/src/models/tf_method_module.py", line 117, in on_test_epoch_end aps, tious = evaluate(self.dataset, self.predictions, self.split, self.setting, self.video_path) File ".../T3AL/src/evaluate.py", line 226, in evaluate print("Predicted labels: ", predicted["label"].unique()) File ".../miniconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 3761, in getitem indexer = self.columns.get_loc(key) File ".../miniconda3/lib/python3.8/site-packages/pandas/core/indexes/range.py", line 349, in get_loc raise KeyError(key) KeyError: 'label'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

benedettaliberatori / T3AL

checkpoints not found #2