bytedance / coconut_cvpr2024

Apache License 2.0
150 stars 6 forks source link

Missing annotations in Coconut L #21

Closed spacycoder closed 3 months ago

spacycoder commented 3 months ago

There seems to be some missing files in the COCONut Large dataset. Huggingface shows there are ~31 000 files, but according to your table there should be roughly 116 000 I think? There seems to be some images in the .tar file that symlinks to your file system e.g. ...object365_train_panseg_copy/objects365_v2_01860597.png ->../bytenas-lq-dxq/zipfile_panseg_copy/4/objects365_v2_01860597.png

xdeng7 commented 3 months ago

thank you, updated the files , please check.

spacycoder commented 2 months ago

Thanks! But I still think there are some issues. Now the object365_train_panseg.tar contains ~166 478 files but I can only open ~63 000 of them, the rest are just symlinks. Should I just ignore the rest?

xdeng7 commented 2 months ago

Thanks! But I still think there are some issues. Now the object365_train_panseg.tar contains ~166 478 files but I can only open ~63 000 of them, the rest are just symlinks. Should I just ignore the rest?

could you please provide a few image ids for me to check? I should copy all the panseg data into one folder

spacycoder commented 2 months ago

Here are a couple of files:

[ WARN:0@158.066] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v2_01892927.png'): can't open/read file: check file path/integrity
[ WARN:0@158.066] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v2_01693633.png'): can't open/read file: check file path/integrity
[ WARN:0@158.066] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v2_01872366.png'): can't open/read file: check file path/integrity
[ WARN:0@158.066] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v1_00331480.png'): can't open/read file: check file path/integrity
[ WARN:0@158.066] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v2_01685998.png'): can't open/read file: check file path/integrity
[ WARN:0@158.066] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v1_00334198.png'): can't open/read file: check file path/integrity
[ WARN:0@158.066] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v1_00360979.png'): can't open/read file: check file path/integrity
[ WARN:0@158.072] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v2_01865393.png'): can't open/read file: check file path/integrity
[ WARN:0@158.072] global loadsave.cpp:241 findDecoder imread_('object365_train_panseg/objects365_v2_01863335.png'): can't open/read file: check file path/integrity

This shows one of the missing files has type "symbolic link":

stat object365_train_panseg/objects365_v2_01892927.png
  File: object365_train_panseg/objects365_v2_01892927.png -> /mnt/bn/bytenas-lq-dxq/zipfile_panseg_copy/6/objects365_v2_01892927.png
  Size: 71              Blocks: 8          IO Block: 4096   symbolic link
Device: fc01h/64513d    Inode: 113785742   Links: 1

This shows one of the files that works and has type "regular file"

 stat object365_train_panseg/objects365_v1_00581444.png
  File: object365_train_panseg/objects365_v1_00581444.png
  Size: 3371            Blocks: 8          IO Block: 4096   regular file
Device: fc01h/64513d    Inode: 113553820   Links: 1
spacycoder commented 2 months ago

Also I think relabeled-coco and coconut-val on huggingface are the same. They both contain ~5000 files

xdeng7 commented 2 months ago

Also I think relabeled-coco and coconut-val on huggingface are the same. They both contain ~5000 files

sorry, I uploaded the wrong folder as in my dataset they all shared the same name, now it is fixed for coconut_val, for large, it is strange as we annotation extra 6k images merging together, you can ignore these 6k, but I will fix it soon. thanks for the issue.

spacycoder commented 2 months ago

The coconut_val dataset contains symlinks. Maybe you should add the "--dereference" option when you tarball the folder? reference

xdeng7 commented 2 months ago

The coconut_val dataset contains symlinks. Maybe you should add the "--dereference" option when you tarball the folder? reference

thanks for the issue, updated.

spacycoder commented 2 months ago

Great, thanks!

tommiekerssies commented 2 months ago

in the huggingface coconut_l there are a lot of symlinks in the kaggle coconut_l the majority of the files in 0 bytes