关于数据准备 - Githubissues

syf-fgnb commented 6 months ago

你好，关于AS-v2的stage2数据集准备，有几个点有一些不确定，还望解惑：

在说明里ScienceQA是需要下载的，但是在数据的目录结构里没有看见对应的folder。所以其实是不需要吗，还是需要放在哪个folder下呢；
ShareGPT4V-100K提供了两个链接，但在结构里好像只用上了第一个链接中的其中三个（web-xxx和wiki），所以剩下的需要下载吗；
`sam' folder的目录结构，是否意味着模型训练只用到了sa000000-sa000063，其他的都归到images folder下呢
在huggingface上还有很多的json文件，比如as_mix_4m.json , rec_detailed_description_42k.json，这些需要放在哪个目录呢

谢谢

Weiyun1025 commented 6 months ago

你好，感谢对我们项目的关注！

需要的，数据放在playground/data/ScienceQA中
ShareGPT4V-100K的所有数据都是用了的，所有的图像都需要下载
ASMv2训练的时候用的是AS-10M预训练和AS-Core微调，这两个数据集只用到了sa000000-sa000063
Stage1的微调用的是llava_v1_5_mix665k_asmv2_format.json，Stage2的预训练的一部分数据是as_pretrain_10m.json（此外还用到了CC12M中的5M样本和GRiT中的15M样本），Stage2的微调用的是as_mix_4m.json。其余的rec_xxx.json都是AS-V2的数据，这些数据已经包含在了as_mix_4m.json，单独放出来是为了方便大家单独使用这部分数据。

syf-fgnb commented 6 months ago

感谢回复，

那ShareGPT4V-100K里的playground, share_textvqa，这些放在哪个目录下呢；
所以可以理解为sam/images这个folder其实上optional的是吗；
了解了。那as_mix_4m.json这个文件是放在playground/data下的吗

Weiyun1025 commented 6 months ago

share_textvqa是放在playground/data下的，即playground/data/share_textvqa
不是，这里sam/images是sharegpt4v用的sam图像（参考他们的github进行配置），其他的则是AS-Core用到的图像（下载SA-1B的图像放过去即可）
放在哪里都可以其实，训练脚本里改一下对应的路径即可，只有图像路径是需要按照README来的，因为as_mix_4m.json里都写成相对路径了

关于图像位置的配置，一种简单的方案是写一个脚本判断一下as_mix_4m.json中的图像是否都存在，看一下那些不存在的图像是哪个数据集的，然后对应的补上即可

import os
import json
from collections import defaultdict

base_dir = playground/data/'
ann_path = 'as_mix_4m.json'
with open(ann_path) as file:
    ann = json.load(file)

start_idx = 0
not_exist_path = defaultdict(int)
info = defaultdict(int)
for idx, item in enumerate(ann[start_idx:], start=start_idx):
    if 'image' not in item:
        continue

    image = item['image']
    exist = os.path.exists(os.path.join(base_dir))

    if not exist:
        info['not_exist'] += 1
        not_exist_path['/'.join(image.split('/')[:-2])] += 1

    if idx % 10000 == 0 and check:
        print()
        print(idx)
        for k, v in info.items():
            print(k, v)
        for k, v in not_exist_path.items():
            print(k, v)

for k, v in info.items():
    print(k, v)
for k, v in not_exist_path.items():
    print(k, v)

print('finish')

OpenGVLab / all-seeing

关于数据准备 #13