Imageomics / bioclip

This is the repository for the BioCLIP model and the TreeOfLife-10M dataset [CVPR'24 Oral, Best Student Paper].
https://imageomics.github.io/bioclip/
Other
166 stars 14 forks source link

Got SyntaxError: not a TIFF file when loading the rare species dataset #21

Closed Yuyan-C closed 4 months ago

Yuyan-C commented 4 months ago

Hi,

Thanks for curating the rare species dataset! I'm trying to load this dataset with PyTorch dataloader, and got the following error.

SyntaxError: not a TIFF file (header b'IIU\x00\x18\x00\x00\x00' not valid)

I downloaded the dataset from huggingface with the following code

from datasets import load_dataset
ds = load_dataset("imageomics/rare-species")

and used PyTorch Dataset and Dataloader to load it:

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, transform=None):
        self.dataset = dataset
        self.transform = transform

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        return self.dataset[idx]["rarespecies_id"]

custom_dataset = CustomDataset(dataset["train"], transform=transform)

dataloader = DataLoader(custom_dataset, batch_size=32, shuffle=False)

for i, data in enumerate(dataloader):
    print(data)

CustomDataset returns rarespecies_id only for simplicity for debugging. I was able to get the ID's printed for the first 66 batches and got the SyntaxError

SyntaxError: not a TIFF file (header b'IIU\x00\x18\x00\x00\x00' not valid)

when loading the 67th batch.

I also checked the Dataset Viewer on huggingface and got the following error on page 22 where the corrupted file should loacate:

Screen Shot 2024-07-15 at 15 15 04

Meanwhile, I can view page 21 and 23:

Screen Shot 2024-07-15 at 15 26 50 Screen Shot 2024-07-15 at 15 27 18
egrace479 commented 4 months ago

Thanks for letting us know! I've opened an issue (discussion 8) on the Hugging Face repo to address this. Please feel free to comment and follow the discussion there.

Yuyan-C commented 4 months ago

Thank you so much!