微调问题：读取图片错误

AriesChen-UPC commented 1 year ago

利用示例程序（bash finetune_XrayGLM.sh）进行数据微调，出现以下错误：

Traceback (most recent call last):
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 194, in <module>
    training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator)
  File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 67, in training_main
    train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
  File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 197, in make_loaders
    train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
  File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 124, in make_dataset_full
    d = create_dataset_function(p, args)
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 160, in create_dataset_function
    dataset = FewShotDataset(path, image_processor, tokenizer, args)
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 117, in __init__
    image = processor(Image.open(item['img']).convert('RGB'))
TypeError: string indices must be integers

测试环境：Google Colab A100 数据存储：Google Drive

PS：与Issues5问题类似，读取存放于Google Drive中的图像等数据，出现问题

OpenHuShen commented 1 year ago

经测试，代码finetune_XrayGLM.py是适用于data/demo/下的dataset.json格式针对读取openi-zh.json数据，我修改了部分代码如下：注：(使用了一个chardet包检测json编码格式，因发现openi-zh.json编码格式为：ASCII) 1、在FewShotDataset前添加get_encoding函数，获取文件编码

import chardet
def get_encoding(file_path):
    # 以二进制方式打开文件，读取一部分内容，然后检测它的编码
    with open(file_path, 'rb') as f:
        data = f.read(100)  # 只读取一部分，以提高效率
    encod = chardet.detect(data)['encoding']
    return encod

2、更改了FewShotDataset的一些代码

class FewShotDataset(Dataset):
    def __init__(self, path, processor, tokenizer, args):
        max_seq_length = args.max_source_length + args.max_target_length
        self.images = []
        self.input_ids = []
        self.labels = []
        encod = get_encoding(path)
        with open(path, 'r', encoding=encod) as f:
            data = json.load(f)
        data = data['annotations']
        for item in data:
            image = processor(Image.open('data/Xray/' + item['image_id']+'.png').convert('RGB'))
            input0 = tokenizer.encode("<img>", add_special_tokens=False)
            input1 = [tokenizer.pad_token_id] * args.image_length
            input2 = tokenizer.encode("</img>问：通过这张胸部x光影像可以诊断出什么？\n答：", add_special_tokens=False)
            a_ids = sum([input0, input1, input2], [])
            b_ids = tokenizer.encode(text=item['caption'], add_special_tokens=False)

3、后面的没有更改

AriesChen-UPC commented 1 year ago

好的，非常感谢我会根据您提供的信息进行调试

lushanfu commented 1 year ago

利用示例程序（bash finetune_XrayGLM.sh）进行数据微调，出现以下错误：

Traceback (most recent call last):
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 194, in <module>
    training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator)
  File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 67, in training_main
    train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
  File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 197, in make_loaders
    train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
  File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 124, in make_dataset_full
    d = create_dataset_function(p, args)
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 160, in create_dataset_function
    dataset = FewShotDataset(path, image_processor, tokenizer, args)
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 117, in __init__
    image = processor(Image.open(item['img']).convert('RGB'))
TypeError: string indices must be integers

测试环境：Google Colab A100 数据存储：Google Drive

PS：与Issues5问题类似，读取存放于Google Drive中的图像等数据，出现问题

执行一下 ./data/build_ch_prompt.py 这个程序, 同时注意一下图片存的路径。然后把finetune_XrayGLM.sh 里面的 json路径改成你刚刚生成的路径即可。作者提供的 openi-zh.json 还不是最终的可训练的 json版本。和visual_GLM 的dataset.json对比一下即可知道。

WangRongsheng / XrayGLM

微调问题：读取图片错误 #36