drivendataorg / zamba

A Python package for identifying 42 kinds of animals, training custom models, and estimating distance from camera trap videos
https://zamba.drivendata.org/docs/stable/
MIT License
107 stars 25 forks source link

Erro using custom dataset and multi-class #309

Closed CrazyGeG closed 6 months ago

CrazyGeG commented 6 months ago

Hi, I have tried to train zamba on other dataset having multi-class label, labels on my dataset is not a subset of zamba class. When I train zamba on my dataset and using time_distributed model, zamba seems not able to read multi-class label under label column on train.csv file, it gives me error like: Not all species have enough videos to allocate into the following splits: train, val, holdout. A minimum of 3 videos per label is required. Found the following counts: {"['amphibian', 'mammal']": 1, "['bird', 'mammal', 'sea animal']": 2, "['fish', 'insect']": 2}. Either remove these labels or add more videos. (type=value_error

The command I use is : zamba train --data-dir --labels --save-dir --model time_distributed

pjbull commented 6 months ago

Hi @CrazyGeG, if I am understanding your question right, you are asking about "multi-label" scenario where there are multiple species in a single video. In this case, you should have multiple rows for each label.

For example, the video chimp_elephant.MP4 below appears twice, once for each species in the video:

filepath,label
blank.MP4,blank
leopard.MP4,leopard
chimp_elephant.MP4,chimpanzee_bonobo
chimp_elephant.MP4,elephant
CrazyGeG commented 6 months ago

Hi @pjbull Thank you so much!

CrazyGeG commented 6 months ago

I'm not sure why the process is freezing:

Screenshot 2024-03-01 at 3 19 05 PM

I'm running on A40 2GPUs and 2 workers

CrazyGeG commented 6 months ago

I'm not sure why the process is freezing:

Screenshot 2024-03-01 at 3 19 05 PM

I'm running on A40 2GPUs and 2 workers

@pjbull

pjbull commented 6 months ago

Hi @CrazyGeG, in the future please provide full logs and configurations. Otherwise, we can't tell what is happening.

Do you see the same thing with num_workers=1? It could be a deadlock.

Finally, for open source projects in general it is generally not recommended to @ particular maintainers.

CrazyGeG commented 6 months ago

Hi, Thank you, I changed number of worker to 1 and got a different error:

2024-03-01 15:18:07.073 | INFO | zamba.data.video:ensure_frame_number:113 - Duplicating last frame 4 times (original: 12, requested: 16). 2024-03-01 15:18:08.289 | INFO | zamba.data.video:ensure_frame_number:113 - Duplicating last frame 13 times (original: 3, requested: 16). 2024-03-01 15:18:14.417 | INFO | zamba.data.video:ensure_frame_number:113 - Duplicating last frame 6 times (original: 10, requested: 16). 2024-03-01 15:18:21.075 | INFO | zamba.data.video:ensure_frame_number:113 - Duplicating last frame 12 times (original: 4, requested: 16). 2024-03-01 15:18:26.402 | INFO | zamba.data.video:ensure_frame_number:113 - Duplicating last frame 14 times (original: 2, requested: 16). [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800381 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800381 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff275a9cd87 in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff276c446e6 in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff276c47c3d in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff276c48839 in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7ff2c449de95 in /network/rit/lab/miniconda3/envs/zamba/bin/../lib/libstdc++.so.6) frame #5: + 0x81da (0x7ff2d47c81da in /lib64/libpthread.so.0) frame #6: clone + 0x43 (0x7ff2d3caae73 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800381 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff275a9cd87 in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff276c446e6 in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff276c47c3d in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff276c48839 in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7ff2c449de95 in /network/rit/lab/miniconda3/envs/zamba/bin/../lib/libstdc++.so.6) frame #5: + 0x81da (0x7ff2d47c81da in /lib64/libpthread.so.0) frame #6: clone + 0x43 (0x7ff2d3caae73 in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff275a9cd87 in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0xdf6b11 (0x7ff27699eb11 in /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd3e95 (0x7ff2c449de95 in /network/rit/lab/miniconda3/envs/zamba/bin/../lib/libstdc++.so.6) frame #3: + 0x81da (0x7ff2d47c81da in /lib64/libpthread.so.0) frame #4: clone + 0x43 (0x7ff2d3caae73 in /lib64/libc.so.6)

pjbull commented 6 months ago

Can you try setting gpus explicitly to 1 in the config? It looks like some multi-gpu communications problem.

CrazyGeG commented 6 months ago

I changed to gpus 1, number of workers 1, I got the following error, the model seems started training, howerver it stopped in the middle: MEpoch 0: 31%|███ | 2235/7219 [55:56<2:04:45, 0.67it/s, v_num=9]^MEpoch 0: 31%|███ | 2235/7219 [55:56<2:04:45, 0.67it/s, v_num=9]2024-03-01 20:36:23.651 | INFO | zamba.data.video:ensure_frame_number:113 - Duplicating last frame 4 times (original: 12, requested: 16). Traceback (most recent call last): File "/network/rit/lab/miniconda3/envs/zamba/lib/python3.8/multiprocessing/queues.py", line 245, in _feed send_bytes(obj) File "/network/rit/lab/miniconda3/envs/zamba/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/network/rit/lab/miniconda3/envs/zamba/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/network/rit/lab/miniconda3/envs/zamba/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe raceback (most recent call last) ──────────────────────╮ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/torch/utils/data/dataloader.py:1133 in _try_get_data │ │ │ │ 1130 │ │ # Returns a 2-tuple: │ │ 1131 │ │ # (bool: whether successfully get data, any: data if succes │ │ 1132 │ │ try: │ │ ❱ 1133 │ │ │ data = self._data_queue.get(timeout=timeout) │ │ 1134 │ │ │ return (True, data) │ │ 1135 │ │ except Exception as e: │ │ 1136 │ │ │ # At timeout and error, we manually check whether any wor │ │ │ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/multiproc │ │ essing/queues.py:108 in get │ │ │ │ 105 │ │ │ │ if block: │ │ 106 │ │ │ │ │ timeout = deadline - time.monotonic() │ │ 107 │ │ │ │ │ if not self._poll(timeout): │ │ ❱ 108 │ │ │ │ │ │ raise Empty │ │ 109 │ │ │ │ elif not self._poll(): │ │ 110 │ │ │ │ │ raise Empty │ │ 111 │ │ │ │ res = self._recv_bytes() │ Empty

The above exception was the direct cause of the following exception: Traceback (most recent call last) ──────────────────────╮ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/zamba/cli.py:182 in train │ │ │ │ 179 │ │ │ 180 │ if yes: │ │ 181 │ │ # kick off training │ │ ❱ 182 │ │ manager.train() │ │ 183 │ │ 184 │ │ 185 @app.command() │ │ │ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/zamba/models/model_manager.py:450 in train │ │ │ │ 447 │ │ return cls(config) │ │ 448 │ │ │ 449 │ def train(self): │ │ ❱ 450 │ │ train_model( │ │ 451 │ │ │ train_config=self.config.train_config, │ │ 452 │ │ │ video_loader_config=self.config.video_loader_config, │ │ 453 │ │ ) │ │ │ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/zamba/models/model_manager.py:324 in train_model │ │ │ │ 321 │ │ │ yaml.dump(configuration, fp) │ │ 322 │ │ │ 323 │ logger.info("Starting training...") │ │ ❱ 324 │ trainer.fit(model, data_module) │ │ 325 │ │ │ 326 │ if not train_config.dry_run: │ │ 327 │ │ if trainer.datamodule.test_dataloader() is not None: │ │ │ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/pytorch_lightning/trainer/trainer.py:543 in fit │ │ │ │ 540 │ │ self.state.fn = TrainerFn.FITTING │ │ 541 │ │ self.state.status = TrainerStatus.RUNNING │ │ 542 │ │ self.training = True │ │ ❱ 543 │ │ call._call_and_handle_interrupt( │ │ 544 │ │ │ self, self._fit_impl, model, train_dataloaders, val_datal │ │ 545 │ │ ) │ │ 546 │ │ │ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/pytorch_lightning/trainer/call.py:44 in _call_and_handle_interrupt │ │ │ │ 41 │ try: │ │ 42 │ │ if trainer.strategy.launcher is not None: │ │ 43 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, args, │ │ ❱ 44 │ │ return trainer_fn(args, **kwargs) │ │ 45 │ │ │ 46 │ except _TunerExitException: │ │ 47 │ │ _call_teardown_hook(trainer) │ │ │ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/pytorch_lightning/trainer/trainer.py:579 in _fit_impl │ │ │ │ 576 │ │ │ model_provided=True, │ │ 577 │ │ │ model_connected=self.lightning_module is not None, │ │ 578 │ │ ) │ │ ❱ 579 │ │ self._run(model, ckpt_path=ckpt_path) │ │ 580 │ │ │ │ 581 │ │ assert self.state.stopped │ │ 582 │ │ self.training = False │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/pytorch_lightning/trainer/trainer.py:1032 in _run_stage │ │ │ │ 1029 │ │ │ with isolate_rng(): │ │ 1030 │ │ │ │ self._run_sanity_check() │ │ 1031 │ │ │ with torch.autograd.set_detect_anomaly(self._detect_anoma │ │ ❱ 1032 │ │ │ │ self.fit_loop.run() │ │ 1033 │ │ │ return None │ │ 1034 │ │ raise RuntimeError(f"Unexpected state {self.state}") │ │ 1035 │ │ │ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/pytorch_lightning/loops/fit_loop.py:205 in run │ │ │ │ 202 │ │ while not self.done: │ │ 203 │ │ │ try: │ │ 204 │ │ │ │ self.on_advance_start() │ │ ❱ 205 │ │ │ │ self.advance() │ │ 206 │ │ │ │ self.on_advance_end() │ │ 207 │ │ │ │ self._restarting = False │ │ 208 │ │ │ except StopIteration: │ │ │ │ /network/rit/lab/miniconda3/envs/zamba/lib/python3.8/site-pack │ │ ages/pytorch_lightning/loops/fit_loop.py:363 in advance │ │ │ │ 360 │ │ │ ) │ │ 361 │ │ with self.trainer.profiler.profile("run_training_epoch"): │ │ 362 │ │ │ assert self._data_fetcher is not None │ │ ❱ 363 │ │ │ self.epoch_loop.run(self._data_fetcher) │ │ 364 │ │ │ 365 │ def on_advance_end(self) -> None: │ │ 366 │ │ trainer = self.trainer │ │ │ RuntimeError: DataLoader worker (pid(s) 1628939) exited unexpectedly ^MEpoch 0: 31%|███ | 2235/7219 [56:05<2:05:04, 0.66it/s, v_num=9]slurmstepd: error: Detected 1 oom_kill event in StepId=7416657.batch. Some of the step tasks have been OOM Killed.

CrazyGeG commented 6 months ago

I have allocated more memory and see if this issue goes away