TIGER-AI-Lab / UniIR

Official code for paper "UniIR: Training and Benchmarking Universal Multimodal Information Retrievers" (ECCV 2024)
https://tiger-ai-lab.github.io/UniIR/
MIT License
94 stars 12 forks source link

How to fastly extract the dataset #20

Open Raion-Shin opened 1 month ago

Raion-Shin commented 1 month ago

I downloaded the .tar.gz file in https://huggingface.co/datasets/TIGER-Lab/M-BEIR, but it's really large and the pv command shows that I need 2.5 days to extract the file! Can you provide smaller zip files that package each dataset into a zip file? Thanks very much!

nrdyava commented 1 week ago

After downloading the .tar.gz files, use the following command to combine the files into a single file: sh -c 'cat mbeir_images.tar.gz.part-00 mbeir_images.tar.gz.part-01 mbeir_images.tar.gz.part-02 mbeir_images.tar.gz.part-03 > mbeir_images.tar.gz'

Next extract images from the combined file: tar -xzf mbeir_images.tar.gz

It will not take 2.5 days. I was able to complete the whole process in just 10 hrs