LTH14 / rcg

PyTorch implementation of RCG https://arxiv.org/abs/2312.03701
MIT License
785 stars 36 forks source link

Decompressed Data Too Large #22

Open ai-agi opened 8 months ago

ai-agi commented 8 months ago

Hi Tianhong. There are always errors when training main_rdm.py regardless of the value setting of batch size, e.g. , from 32 to 512. What' s the problem and how to solve it? the following is collapse information: File "main_rdm.py", line 186, in main train_stats = train_one_epoch( File "/home/fengjiw/project/rcg/engine_rdm.py", line 29, in train_one_epoch for data_iter_step, (samples, class_label) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): File "/home/fengjiw/project/rcg/util/misc.py", line 134, in log_every for obj in iterable: File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise raise exception ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 230, in getitem sample = self.loader(path) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 269, in default_loader return pil_loader(path) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 248, in pil_loader img = Image.open(f) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/PIL/Image.py", line 3172, in open im = _open_core(fp, filename, prefix, formats) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/PIL/Image.py", line 3158, in _open_core im = factory(fp, filename) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/PIL/ImageFile.py", line 116, in init self._open() File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/PIL/PngImagePlugin.py", line 734, in open s = self.png.call(cid, pos, length) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/PIL/PngImagePlugin.py", line 202, in call return getattr(self, "chunk" + cid.decode("ascii"))(pos, length) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/PIL/PngImagePlugin.py", line 412, in chunk_iCCP icc_profile = _safe_zlib_decompress(s[i + 2 :]) File "/home/fengjiw/miniconda3/envs/rcg/lib/python3.8/site-packages/PIL/PngImagePlugin.py", line 148, in _safe_zlib_decompress raise ValueError("Decompressed Data Too Large")

LTH14 commented 8 months ago

Do you use customized data? The code should work fine with the ImageNet dataset. Maybe try to make each of your own data smaller?

gzhuinjune commented 4 months ago

我也遇到了类似的问题。我用的是自定义的数据集,但是还是把它放在imagenet的train下面,请问可以帮我看看吗: image Traceback (most recent call last): File "main_mage.py", line 296, in <module> main(args) File "main_mage.py", line 269, in main gen_img(model, args, epoch, batch_size=16, log_writer=log_writer, cfg=0) File "/home/user/sdb2/rcg-main/engine_mage.py", line 124, in gen_img metrics_dict = torch_fidelity.calculate_metrics( File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/metrics.py", line 341, in calculate_metrics return calculate_metrics_one_feature_extractor(kwargs) File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/metrics.py", line 80, in calculate_metrics_one_feature_extractor featuresdict_2 = extract_featuresdict_from_input_id_cached(2, feat_extractor, kwargs) File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/utils.py", line 424, in extract_featuresdict_from_input_id_cached featuresdict = fn_recompute() File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/utils.py", line 412, in fn_recompute return extract_featuresdict_from_input_id(input_id, feat_extractor, kwargs) File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/utils.py", line 394, in extract_featuresdict_from_input_id input = prepare_input_from_id(input_id, kwargs) File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/utils.py", line 317, in prepare_input_from_id return prepare_input_from_descriptor(input_desc, kwargs)In June: File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/utils.py", line 317, in prepare_input_from_id return prepare_input_from_descriptor(input_desc, kwargs) File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/utils.py", line 293, in prepare_input_from_descriptor vassert( File "/home/user/sdb2/rcg-main/src/torch-fidelity/torch_fidelity/helpers.py", line 13, in vassert raise ValueError(message) ValueError: Input descriptor "input" field can be either an instance of Dataset, GenerativeModelBase class, or a string, such as a path to a name of a registered dataset (cifar10-train, cifar10-val, cifar100-train, cifar100-val, stl10-train, stl10-test, stl10-unlabeled), a directory with file samples, or a path to an ONNX or PTH (JIT) module [2024-04-05 01:20:50,214] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2369647) of binary: /home/user/anaconda3/envs/rcg/bin/python Traceback (most recent call last): File "/home/user/anaconda3/envs/rcg/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/user/anaconda3/envs/rcg/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/user/anaconda3/envs/rcg/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/home/user/anaconda3/envs/rcg/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/user/anaconda3/envs/rcg/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call

In June: return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/anaconda3/envs/rcg/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_mage.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-05_01:20:50 host : SYS-740GP-TNRT rank : 0 (local_rank: 0) exitcode : 1 (pid: 2369647) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

LTH14 commented 4 months ago

您好,您的这个问题是由于torch-fidelity可能处理不了自定义数据,可能在这里要将torch_fidelity.calculate_metrics换成你自己数据上的metric

gzhuinjune commented 4 months ago

谢谢大哥,是不是改这个路径就行。这个路径就是我的val的路径,里面有一个叫做building的分类的文件夹(我的任务只有一个类别),文件夹里面有我的图片 image

LTH14 commented 4 months ago

那你把这个input2设置成/home/user/sdb2/rcg-main/data/imagenet/val/building就行,他这里的input2需要是一个里面全是图片的文件夹。另一个需要注意的是,如果要evaluate你自己数据集上的FID,请不要使用修改过的torch fidelity,而是直接安装原版pip install torch-fidelity

gzhuinjune commented 4 months ago

谢谢大哥,祝你科研顺利

gzhuinjune commented 3 months ago

那你把这个input2设置成/home/user/sdb2/rcg-main/data/imagenet/val/building就行,他这里的input2需要是一个里面全是图片的文件夹。另一个需要注意的是,如果要evaluate你自己数据集上的FID,请不要使用修改过的torch fidelity,而是直接安装原版pip install torch-fidelity 我现在用了自己的数据集,那么我的engine_mage里面的class_num=1000也做对应的修改对嘛,请问我还要改别的哪里嘛 。“一个里面全是图片的文件夹”指的是我自己所有类别图片混在一起的文件夹,还是我其中一个类别的图片的文件夹呢,谢谢!

LTH14 commented 3 months ago

那你把这个input2设置成/home/user/sdb2/rcg-main/data/imagenet/val/building就行,他这里的input2需要是一个里面全是图片的文件夹。另一个需要注意的是,如果要evaluate你自己数据集上的FID,请不要使用修改过的torch fidelity,而是直接安装原版pip install torch-fidelity 我现在用了自己的数据集,那么我的engine_mage里面的class_num=1000也做对应的修改对嘛,请问我还要改别的哪里嘛 。“一个里面全是图片的文件夹”指的是我自己所有类别图片混在一起的文件夹,还是我其中一个类别的图片的文件夹呢,谢谢!

所有图片混在一起的