IDEA-FinAI / RagVL

Official PyTorch Implementation of MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training.
MIT License
33 stars 3 forks source link

Regarding the Download of the Dataset #16

Closed Aeryn666 closed 1 month ago

Aeryn666 commented 1 month ago

Could you please clarify where train_img and val_img come from? Are they from web_qa? However, web_qa unzips to a TSV file.

SakuraTroyChen commented 1 month ago

Yes, train_img and val_image are from WebQA. You can download the images by running the scripts.

Aeryn666 commented 1 month ago

Yes, train_img and val_image are from WebQA. You can download the images by running the scripts.

After I downloaded and unzipped it was a tsv file,could you share how you process the tsv file and split it to train images and val images? thx🥺🥺🥺

SakuraTroyChen commented 1 month ago

Follow the instructions from WebQA, to unzip and merge all chunks, run 7z x imgs.7z.001. You need to download all 51 files before running this.

If the script fails, you can download WebQA_imgs_7z_chunks from the google drive.

Tree-Shu-Zhao commented 1 month ago

@SakuraTroyChen In the "Data Preparation" section, you mentioned that "Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/" and "Place the val_image/ into RagVL/datasets/" in steps 5 and 6. But after unzipping the .7z data files, the output is a .tsv file. The script you mentioned (https://github.com/WebQnA/WebQA/blob/main/download_imgs.sh) to download the data files seems to just download files without any further processing steps. Could you please tell me how to get "train_img" and "val_image" folders?

@Aeryn666 Have you figured out how to get these folders?

SakuraTroyChen commented 1 month ago

@SakuraTroyChen In the "Data Preparation" section, you mentioned that "Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/" and "Place the val_image/ into RagVL/datasets/" in steps 5 and 6. But after unzipping the .7z data files, the output is a .tsv file. The script you mentioned (https://github.com/WebQnA/WebQA/blob/main/download_imgs.sh) to download the data files seems to just download files without any further processing steps. Could you please tell me how to get "train_img" and "val_image" folders?

@Aeryn666 Have you figured out how to get these folders?

The script appears to be invalid. We recommend downloading the images directly from the google drive.

Tree-Shu-Zhao commented 1 month ago

@SakuraTroyChen In the "Data Preparation" section, you mentioned that "Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/" and "Place the val_image/ into RagVL/datasets/" in steps 5 and 6. But after unzipping the .7z data files, the output is a .tsv file. The script you mentioned (https://github.com/WebQnA/WebQA/blob/main/download_imgs.sh) to download the data files seems to just download files without any further processing steps. Could you please tell me how to get "train_img" and "val_image" folders? @Aeryn666 Have you figured out how to get these folders?

The script appears to be invalid. We recommend downloading the images directly from the google drive.

Thanks for your prompt reply. The files in this google drive link still do not contain "train_img" and "val_image" folders. As I mentioned above, unzipping "WebQA_imgs_7z_chunks" will get "imgs.tsv"; unzipping "WebQA_data_first_release.7z" will obtain "WebQA_test.json" and "WebQA_val.json"; The last file, "imgs.lineidx", also does not contain the two folders.

Aeryn666 commented 1 month ago

@SakuraTroyChen In the "Data Preparation" section, you mentioned that "Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/" and "Place the val_image/ into RagVL/datasets/" in steps 5 and 6. But after unzipping the .7z data files, the output is a .tsv file. The script you mentioned (https://github.com/WebQnA/WebQA/blob/main/download_imgs.sh) to download the data files seems to just download files without any further processing steps. Could you please tell me how to get "train_img" and "val_image" folders?

@Aeryn666 Have you figured out how to get these folders?

Oh, I had the same question too, and I've resolved it now. After downloading and unzipping from Google Drive, getting “imgs.tsv” is correct. This is a file encoded in base64, so you need to decode it. After decoding, you will get a sequence containing image ids and the images themselves. You can then distinguish between the training dataset and the test dataset based on the image ids provided in "WebQA_test.json" and "WebQA_val.json".This is my solution, for reference only.

Tree-Shu-Zhao commented 2 weeks ago

@Aeryn666 Thanks for your help, I successfully extracted images. I leave the extraction code here for guys who may need it.

import argparse
import base64
import io
import json
import os

from PIL import Image
from tqdm import tqdm

def decode_imgs_tsv(file_path):
    with open(file_path, 'r') as f:
        lines = f.readlines()

    images = {}
    for line in tqdm(lines, desc="Decoding images"):
        image_id, encoded_image = line.strip().split('\t')
        image_data = base64.b64decode(encoded_image)
        images[image_id] = image_data

    return images

def load_json_file(file_path):
    with open(file_path, 'r') as f:
        return json.load(f)

def save_image(image_data, output_path):
    image = Image.open(io.BytesIO(image_data))
    image.convert('RGB').save(output_path)

def main():
    args = parse_arguments()

    # Create output directories
    os.makedirs(os.path.join(args.output_dir, 'train'), exist_ok=True)
    os.makedirs(os.path.join(args.output_dir, 'val'), exist_ok=True)

    # Decode images
    images = decode_imgs_tsv(args.tsv_path)

    # Load JSON files
    data = load_json_file(args.data_json_path)

    # Get image IDs and save images
    for guid in tqdm(data, desc="Saving images"):
        split = data[guid]["split"]
        assert split in ["train", "val"]
        img_infos = data[guid]["img_posFacts"] + data[guid]["img_negFacts"]
        for item in img_infos:
            image_id = str(item["image_id"])
            output_path = os.path.join(args.output_dir, f"{split}", f"{image_id}.png")
            if not os.path.exists(output_path):
                try:
                    save_image(images[image_id], output_path)
                except:
                    print(f"Failed to save image {image_id}")

    print("Data preparation completed!")

def parse_arguments():
    parser = argparse.ArgumentParser(description="Extract images from TSV file and save them")
    parser.add_argument("--tsv_path", type=str, required=True, help="Path to the TSV file")
    parser.add_argument("--data_json_path", type=str, required=True, help="Path to WebQA_train_val.json")
    parser.add_argument("--output_dir", type=str, required=True, help="Output directory to save images")
    return parser.parse_args()

if __name__ == "__main__":
    main()