Closed Aeryn666 closed 1 month ago
Yes, train_img and val_image are from WebQA. You can download the images by running the scripts.
Yes, train_img and val_image are from WebQA. You can download the images by running the scripts.
After I downloaded and unzipped it was a tsv file,could you share how you process the tsv file and split it to train images and val images? thx🥺🥺🥺
Follow the instructions from WebQA, to unzip and merge all chunks, run 7z x imgs.7z.001
. You need to download all 51 files before running this.
If the script fails, you can download WebQA_imgs_7z_chunks from the google drive.
@SakuraTroyChen In the "Data Preparation" section, you mentioned that "Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/" and "Place the val_image/ into RagVL/datasets/" in steps 5 and 6. But after unzipping the .7z data files, the output is a .tsv file. The script you mentioned (https://github.com/WebQnA/WebQA/blob/main/download_imgs.sh) to download the data files seems to just download files without any further processing steps. Could you please tell me how to get "train_img" and "val_image" folders?
@Aeryn666 Have you figured out how to get these folders?
@SakuraTroyChen In the "Data Preparation" section, you mentioned that "Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/" and "Place the val_image/ into RagVL/datasets/" in steps 5 and 6. But after unzipping the .7z data files, the output is a .tsv file. The script you mentioned (https://github.com/WebQnA/WebQA/blob/main/download_imgs.sh) to download the data files seems to just download files without any further processing steps. Could you please tell me how to get "train_img" and "val_image" folders?
@Aeryn666 Have you figured out how to get these folders?
The script appears to be invalid. We recommend downloading the images directly from the google drive.
@SakuraTroyChen In the "Data Preparation" section, you mentioned that "Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/" and "Place the val_image/ into RagVL/datasets/" in steps 5 and 6. But after unzipping the .7z data files, the output is a .tsv file. The script you mentioned (https://github.com/WebQnA/WebQA/blob/main/download_imgs.sh) to download the data files seems to just download files without any further processing steps. Could you please tell me how to get "train_img" and "val_image" folders? @Aeryn666 Have you figured out how to get these folders?
The script appears to be invalid. We recommend downloading the images directly from the google drive.
Thanks for your prompt reply. The files in this google drive link still do not contain "train_img" and "val_image" folders. As I mentioned above, unzipping "WebQA_imgs_7z_chunks" will get "imgs.tsv"; unzipping "WebQA_data_first_release.7z" will obtain "WebQA_test.json" and "WebQA_val.json"; The last file, "imgs.lineidx", also does not contain the two folders.
@SakuraTroyChen In the "Data Preparation" section, you mentioned that "Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/" and "Place the val_image/ into RagVL/datasets/" in steps 5 and 6. But after unzipping the .7z data files, the output is a .tsv file. The script you mentioned (https://github.com/WebQnA/WebQA/blob/main/download_imgs.sh) to download the data files seems to just download files without any further processing steps. Could you please tell me how to get "train_img" and "val_image" folders?
@Aeryn666 Have you figured out how to get these folders?
Oh, I had the same question too, and I've resolved it now. After downloading and unzipping from Google Drive, getting “imgs.tsv” is correct. This is a file encoded in base64, so you need to decode it. After decoding, you will get a sequence containing image ids and the images themselves. You can then distinguish between the training dataset and the test dataset based on the image ids provided in "WebQA_test.json" and "WebQA_val.json".This is my solution, for reference only.
@Aeryn666 Thanks for your help, I successfully extracted images. I leave the extraction code here for guys who may need it.
import argparse
import base64
import io
import json
import os
from PIL import Image
from tqdm import tqdm
def decode_imgs_tsv(file_path):
with open(file_path, 'r') as f:
lines = f.readlines()
images = {}
for line in tqdm(lines, desc="Decoding images"):
image_id, encoded_image = line.strip().split('\t')
image_data = base64.b64decode(encoded_image)
images[image_id] = image_data
return images
def load_json_file(file_path):
with open(file_path, 'r') as f:
return json.load(f)
def save_image(image_data, output_path):
image = Image.open(io.BytesIO(image_data))
image.convert('RGB').save(output_path)
def main():
args = parse_arguments()
# Create output directories
os.makedirs(os.path.join(args.output_dir, 'train'), exist_ok=True)
os.makedirs(os.path.join(args.output_dir, 'val'), exist_ok=True)
# Decode images
images = decode_imgs_tsv(args.tsv_path)
# Load JSON files
data = load_json_file(args.data_json_path)
# Get image IDs and save images
for guid in tqdm(data, desc="Saving images"):
split = data[guid]["split"]
assert split in ["train", "val"]
img_infos = data[guid]["img_posFacts"] + data[guid]["img_negFacts"]
for item in img_infos:
image_id = str(item["image_id"])
output_path = os.path.join(args.output_dir, f"{split}", f"{image_id}.png")
if not os.path.exists(output_path):
try:
save_image(images[image_id], output_path)
except:
print(f"Failed to save image {image_id}")
print("Data preparation completed!")
def parse_arguments():
parser = argparse.ArgumentParser(description="Extract images from TSV file and save them")
parser.add_argument("--tsv_path", type=str, required=True, help="Path to the TSV file")
parser.add_argument("--data_json_path", type=str, required=True, help="Path to WebQA_train_val.json")
parser.add_argument("--output_dir", type=str, required=True, help="Output directory to save images")
return parser.parse_args()
if __name__ == "__main__":
main()
Could you please clarify where train_img and val_img come from? Are they from web_qa? However, web_qa unzips to a TSV file.