RuntimeError: DataLoader worker (pid 32978) is killed by signal: Aborted.

Aakash3101 commented 6 months ago

Hi, I am unable to run CCD, it seems that the Memory usage spikes immediately. I have reduced num_workers to 0, and also reduced batch size to 8 and also used fp16. But it seems to me that the Dataloader is creating problem only while running train.py. I was able to run 'test.py` on ARD model.

TongkunGuan commented 6 months ago

Hi, I am unable to run CCD, it seems that the Memory usage spikes immediately. I have reduced num_workers to 0, and also reduced batch size to 8 and also used fp16. But it seems to me that the Dataloader is creating problem only while running train.py. I was able to run 'test.py` on ARD model.

Can you describe your problem in detail, including hardware configuration and reported problem?

Aakash3101 commented 6 months ago

OS: Ubuntu 22.04.4 LTS x86_64 GPU: NVIDIA GeForce GTX 1050 Ti Mobile (Cuda : 12.2) RAM: 16 GB

I have followed the instructions for installation with torch==1.10.0+cu113 and other similar dependencies. I tried inferencing the ARD model on my dataset, and it works fine. But when I try to train a model using 'train.py', I get an error message that the dataloader processes have been killed by signal.

This is my CCD_pretrain_ViT_Base.yaml :

global:
  name: pre_base_65536
  phase: train
  stage: pretrain-vision
  workdir: workdir
  seed: ~

output_dir: './saved_models/'

dataset:
  scheme: selfsupervised_kmeans
  type: ST
  train: {
    roots: [
        '/home/aakash01/Desktop/parseq/results/train/real',
        # 'xxx/data_lmdb/training/label/Synth',
        # 'xxx/data_lmdb/training/URD/OCR-CC',
    ],
  }
  valid: {
    roots: [
        '/home/aakash01/Desktop/parseq/results/val',
        # 'xxx/data_lmdb/validation',
    ],
  }
  test: {
    roots: [
        '/home/aakash01/Desktop/parseq/results/test',
        # 'xxx/data_lmdb/evaluation/benchmark',
        # 'xxx/data_lmdb/evaluation/addition',
    ],
  }
  data_aug: True
  multiscales: False
  mask: False
  num_workers: 8
  augmentation_severity: 5
  charset_path: './Dino/data/charset_95.txt'
  mask_path: None #'xxx/data_lmdb/Mask'

training:
  epochs: 3
  start_iters: 0
  show_iters: 200
  eval_iters: 3000
  save_iters: 50000

model:
  name: 'Dino.model.dino_vision.ABIDINOModel'
  seg_channel: 512
  checkpoint: ~

mp:
  num: 1

arch: 'vit_base'
patch_size: 4
out_dim: 65536
#Not normalizing leads to better performance but can make the training unstable.
#In our experiments, we typically set this paramater to False with vit_small and True with vit_base."""
norm_last_layer: True
#We recommend setting a higher value with small batches: for example use 0.9995 with batch size of 256.
momentum_teacher: 0.9995
#Initial value for the teacher temperature: 0.04 works well in most cases.
#Try decreasing it if the training loss does not decrease.
warmup_teacher_temp: 0.04
#We recommend starting with the default value of 0.04 and increase this slightly if needed.
teacher_temp: 0.04
#Number of warmup epochs for the teacher temperature (Default: 30).
warmup_teacher_temp_epochs: 0
batch_size_per_gpu: 8
#The learning rate is linearly scaled with the batch size, and specified here for a reference batch size of 256.
lr: 0.0005
#Clipping with norm .3 ~ 1.0 can help optimization for larger ViT architectures.
clip_grad: 3.0
use_bn_in_head: False
use_fp16: False
weight_decay: 0.04
weight_decay_end: 0.4
epochs: 100
freeze_last_layer: 1
warmup_epochs: 10
min_lr: 0.000001
optimizer: adamw
drop_path_rate: 0.1
global_crops_scale: (0.4, 1.)
local_crops_number: 8
crops_number: 2
local_crops_scale: (0.05, 0.4)
seed: 0
num_workers: 8
dist_url: "env://"
local_rank: 0
saveckp_freq: 10

warmup_epoch: 10
imgnet_based: 1000000

I have disabled using masks, the num_workers is set to 8. The error message that I receive is :

Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007fc3770c5740 (most recent call first):
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/_util.py", line 6 in is_path
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/ImageFile.py", line 103 in __init__
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/JpegImagePlugin.py", line 822 in jpeg_factory
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/Image.py", line 3263 in _open_core
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/Image.py", line 3277 in open
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 143 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
  File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
  ...
Traceback (most recent call last):
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 21829) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 457, in <module>
    train(config)
  File "train.py", line 187, in train
    for (image_tensors, masks, metrics) in metric_logger.log_every(train_dataloader, 10, header):
  File "/home/aakash01/Desktop/CCD/Dino/modules/utils.py", line 388, in log_every
    for obj in iterable:
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1142, in _get_data
    success, data = self._try_get_data()
  File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 21829) exited unexpectedly

I have read about this error, and it seems to be related to Memory usage being extremely high. Though I am not sure why this is happening, because I have worked with LMDB datasets being used in Dataloaders in PARSeq model. I have tried reducing the batch size to 1 and also setting 'num_workers' to 8, but the error still persists.

Also can you explain what the 'mp' flag in the config file means? I first want to try training the model on my local machine and then put a batch job on the HPC cluster.

TongkunGuan commented 5 months ago

File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 143 in get

I read the issue.

I think you should put a breakpoint on line 166 （/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py） to check whether there are errors in reading the data.

mp is a multi-process marker, which was later deprecated. You should ignore it.

yjjexcellent commented 5 months ago

I met the same problem as you, we use almost the same yaml file, have you solve this?

TongkunGuan commented 5 months ago

I met the same problem as you, we use almost the same yaml file, have you solve this?

issue from https://github.com/TongkunGuan/CCD/blob/0b6bdf9415a0d33d7e7a9adac21d9036d915709d/Dino/dataset/dataset.py#L133

def get(self, idx): with self.env.begin(write=False) as txn: image_key, label_key = f'image-{idx + 1:09d}', f'label-{idx + 1:09d}' try: imgbuf = txn.get(image_key.encode()) # image buf = six.BytesIO() buf.write(imgbuf) buf.seek(0) with warnings.catch_warnings(): warnings.simplefilter("ignore", UserWarning) # EXIF warning from TiffPlugin image = PIL.Image.open(buf).convert(self.convert_mode) with self.mask_env.begin(write=False) as mask_txn: mask_key = f'mask-{idx + 1:09d}' try: maskbuf = mask_txn.get(mask_key.encode()) # image mask_buf = six.BytesIO() mask_buf.write(maskbuf) mask_buf.seek(0) mask = PIL.Image.open(mask_buf).convert('L') except: print(f"Corrupted image for {idx}") mask = np.zeros((self.img_w, self.img_h)) if self.is_training and not self._check_image(image): return self._next_image() except: return self._next_image() return image, mask, idx

You should check the correctness before the self._next_image(). Do you add the mask_env file?

TongkunGuan / CCD

RuntimeError: DataLoader worker (pid 32978) is killed by signal: Aborted. #11