Closed AriesChen-UPC closed 1 year ago
经测试,代码finetune_XrayGLM.py是适用于data/demo/下的dataset.json格式 针对读取openi-zh.json数据,我修改了部分代码如下: 注:(使用了一个chardet包检测json编码格式,因发现openi-zh.json编码格式为:ASCII) 1、在FewShotDataset前添加get_encoding函数,获取文件编码
import chardet
def get_encoding(file_path):
# 以二进制方式打开文件,读取一部分内容,然后检测它的编码
with open(file_path, 'rb') as f:
data = f.read(100) # 只读取一部分,以提高效率
encod = chardet.detect(data)['encoding']
return encod
2、更改了FewShotDataset的一些代码
class FewShotDataset(Dataset):
def __init__(self, path, processor, tokenizer, args):
max_seq_length = args.max_source_length + args.max_target_length
self.images = []
self.input_ids = []
self.labels = []
encod = get_encoding(path)
with open(path, 'r', encoding=encod) as f:
data = json.load(f)
data = data['annotations']
for item in data:
image = processor(Image.open('data/Xray/' + item['image_id']+'.png').convert('RGB'))
input0 = tokenizer.encode("<img>", add_special_tokens=False)
input1 = [tokenizer.pad_token_id] * args.image_length
input2 = tokenizer.encode("</img>问:通过这张胸部x光影像可以诊断出什么?\n答:", add_special_tokens=False)
a_ids = sum([input0, input1, input2], [])
b_ids = tokenizer.encode(text=item['caption'], add_special_tokens=False)
3、后面的没有更改
好的,非常感谢 我会根据您提供的信息进行调试
利用示例程序(bash finetune_XrayGLM.sh)进行数据微调,出现以下错误:
Traceback (most recent call last): File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 194, in <module> training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 67, in training_main train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn) File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 197, in make_loaders train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True) File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 124, in make_dataset_full d = create_dataset_function(p, args) File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 160, in create_dataset_function dataset = FewShotDataset(path, image_processor, tokenizer, args) File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 117, in __init__ image = processor(Image.open(item['img']).convert('RGB')) TypeError: string indices must be integers
测试环境:Google Colab A100 数据存储:Google Drive
PS:与Issues5问题类似,读取存放于Google Drive中的图像等数据,出现问题
执行一下 ./data/build_ch_prompt.py 这个程序, 同时注意一下图片存的路径。然后把finetune_XrayGLM.sh 里面的 json路径改成你刚刚生成的路径即可。作者提供的 openi-zh.json 还不是最终的可训练的 json版本。和visual_GLM 的dataset.json对比一下即可知道。
利用示例程序(bash finetune_XrayGLM.sh)进行数据微调,出现以下错误:
测试环境:Google Colab A100 数据存储:Google Drive
PS:与Issues5问题类似,读取存放于Google Drive中的图像等数据,出现问题