grewanhassan commented 2 years ago

Hi,

first thanks for your great work! I'm trying to train the refine part, but I get some weird error. I tried everything but nothing helps. Maybe You guys have some ideas!

File "~/.venv/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 272, in __getitem__
  return self.dataset[self.indices[idx]]
File "~/BMV2_/dataset/zip.py", line 17, in __getitem__
  x = tuple(d[(idx % len(d))+1] for d in self.datasets)

ZeroDivisionError: integer division or modulo by zero

File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)

grewanhassan commented 2 years ago

Here the full report, if it helps:

CUDA_VISIBLE_DEVICES=0,1 python train_refine.py
--dataset-name videomatte240k
--model-backbone resnet50
--model-name mattingrefine-resnet50-videomatte240k
--model-last-checkpoint "~/DATASET/input/BMV2/Model_Weights/PyTorch/pytorch_resnet50.pth"
--epoch-end 1

DATA_PATH = {
    'videomatte240k': {
        'train': {
            'fgr': '/home/grewan/DATASET/input/BMV2/VideoMatte240K_JPEG_SD/train/fgr',
            'pha': '/home/grewan/DATASET/input/BMV2/VideoMatte240K_JPEG_SD/train/pha'
        },
        'valid': {
            'fgr': '/home/grewan/DATASET/input/BMV2/VideoMatte240K_JPEG_SD/test/fgr',
            'pha': '/home/grewan/DATASET/input/BMV2/VideoMatte240K_JPEG_SD/test/pha'
        }
    },
}

Loaded state_dict: 394/394 matched
  0%|                                                                                                                 | 0/59497 [00:00<?, ?it/s]Loaded state_dict: 394/394 matched
  0%|                                                                                                                 | 0/59497 [00:09<?, ?it/s]
Traceback (most recent call last):
  File "train_refine.py", line 309, in <module>
    join=True)
  File "~/.venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "~/.venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "~/.venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "~/.venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "~/BMV2/train_refine.py", line 169, in train_worker
    for i, ((true_pha, true_fgr), true_bgr) in enumerate(tqdm(dataloader_train)):
  File "~/.venv/lib/python3.6/site-packages/tqdm/std.py", line 1171, in __iter__
    for obj in iterable:
  File "~/.venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "~/.venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "~/.venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "~/.venv/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "~/.venv/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "~/.venv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "~/.venv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "~/.venv/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 272, in __getitem__
    return self.dataset[self.indices[idx]]
  File "~/BMV2_/dataset/zip.py", line 17, in __getitem__
    x = tuple(d[(idx % len(d))+1] for d in self.datasets)
  File "~/BMV2_/dataset/zip.py", line 17, in <genexpr>
    x = tuple(d[(idx % len(d))+1] for d in self.datasets)
ZeroDivisionError: integer division or modulo by zero

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz Stepping: 2 CPU MHz: 2360.725 CPU max MHz: 3700.0000 CPU min MHz: 1200.0000 BogoMIPS: 6996.13 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

grewanhassan commented 2 years ago

The issue was, that the background path wasn't defined in data_path.py.

davislee546 commented 1 year ago

The issue was, that the background path wasn't defined in data_path.py.

Hello, Can you run this train_refine.py normally?I found it stuck after training for a period of time while running,and Volatile GPU Util is 100%. I need to manually kill the process Otherwise, the resources will continue to be occupied.

Kamlesh364 commented 3 months ago

@davislee546 did you solve this issue?

Kamlesh364 commented 3 months ago

@grewanhassan

"~/DATASET/input/BMV2/Model_Weights/PyTorch/pytorch_resnet50.pth"

are these weights From stage one or else?

Kamlesh364 commented 3 months ago

@grewanhassan how did you use the checkpoints from stage 1 in stage 2. Can you provide full command to run stage 2?

PeterL1n / BackgroundMattingV2

train_refine #176

Here the full report, if it helps: