Morizeyao / GPT2-Chinese

Chinese version of GPT2 training code, using BERT tokenizer.
MIT License
7.48k stars 1.7k forks source link

哪位大神可以指点下如何制作train.json文件? #259

Open cxhermagic opened 2 years ago

cxhermagic commented 2 years ago

感觉用["train.txt", "train.txt", "train.txt"]这种方式不行,其中train.txt里面是斗破苍穹的文本

homuraLan commented 1 year ago

读取小说文件json一直报错,我还以为可以直接读取小说为json结果不行?

homuraLan commented 1 year ago

''' 如果训练材料是全部堆在一起不分篇章的话用这个文件 ''' 真是骗到我了

homuraLan commented 1 year ago

我知道怎么搞了,代码有坑

LemonFan-maker commented 1 year ago

是不是json太长了读取失败?如果这样的话可以参考一下这个 https://github.com/Morizeyao/GPT2-Chinese/issues/174#issue-723932145

LemonFan-maker commented 1 year ago

如果要训练一本书推荐用train_single.pyREADME中也有说明

cywjava commented 1 year ago

这样的格式 ["文章内容","文章内容2",“文章内容3”]

cywjava commented 1 year ago

我想知道的是,可不可以搞多个train.json, 训练多个后,模型文件生成在一个bin里。

Yang-qwq commented 1 year ago

你可以自己修改代码,不过我写了个简单的小程序用来创建train.json

# -*- coding: utf-8 -*-
import json
import sys
import os

with open('train.json', 'a+', encoding='utf-8') as t:
    t.seek(0)
    try:
        content = json.load(t)
    except json.JSONDecodeError:
        json.dump([], t)
        content = []

    t.seek(0)

    try:
        sys.argv[1]
    except IndexError:
        for each in os.listdir():
            if each.endswith('.txt'):
                with open(each, 'r+', encoding='utf-8') as f:
                    print(f'loaded: {each}')
                    content.append(f.read())
    else:
        with open(sys.argv[1], 'r+', encoding='utf-8') as f:
            print(f'loaded: {sys.argv[1]}')
            content.append(f.read())

    t.truncate()
    json.dump(content, t, ensure_ascii=False)

    print(f'writed {len(content)} objects.')