关于import报错 - Githubissues

EHEROCORE commented 5 months ago

ImportError: cannot import name 'image_aug' from 'tool.image_aug' 在运行train.py的时候出现该报错，请问有解决方法吗

wang-zhix commented 5 months ago

换成 from tool.image_aug import aug_sequential

train_transformer = image_aug # 训练集数据增强换成 train_transformer = aug_sequential # 训练集数据增强

Gmgge commented 5 months ago

@wang-zhix @EHEROCORE 非常感谢你们的bug提出与解决方案，已据此修复，请拉取最新的代码尝试。同事也欢迎大家提pr

EHEROCORE commented 5 months ago

非常感谢您的解答，但是我修复这个问题之后，在之后训练模型时任然有报错

UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 20: illegal multibyte sequence

调试后我猜想是数据集格式跟代码中要求的不同导致的，代码中要求的格式如/tmp/0/0.jpg #image 但是在DataSet.md文件的链接中解压出的几个数据集都在外面没有数字命名的文件夹也就是/tmp/0.jpg的结构，少了一层文件夹，请问这如何解决？不知是不是做数据集准备时这段代码内容运行出了问题

python tool/gen_vocab.py \ --dataset_path "dataset/cust-data/0/" \ --cust_vocab ./cust-data/vocab.txt

我在运行这段代码之前没搞懂项目逻辑，把解压出的seal_0放在dataset文件夹内并且把代码内容中的cust-data/0/改成了seal_0，但是之后搞懂逻辑之后又把文件夹名称更改了回来，只是cust_data中没有多一层数字命名的文件夹，是这个操作的原因吗，如果是的话，请问如何补救

EHEROCORE commented 5 months ago

另外我在这之前也遇到了编码格式问题，如下 raise ValueError( ValueError: Mixed precision training with AMP or APEX (--fp16 or --bf16) and half precision evaluation (--fp16_full_eval or --bf16_full_eval) can only be used on CUDA devices. 在这个时间点我刚刚提交了issue中遇到的第一个问题，并且更改了image_aug与trian中的image加s的问题，使这个暂时不报错了，但是不知道是否真的修复了这个问题，在处理编码错误的问题时，我按照网上的解决方法：确认了torch能够使用gpu并且降低了transformers以及pytorch的版本，但是并未解决这个问题，随后我得到了两位的回复，并进行了更改，然后报错有所变化，变成了上面回复的问题，不知道把更改的过程讲的详细些会不会对解决这个问题有所帮助，非常感谢二位，希望能得到回复。

wang-zhix commented 5 months ago

UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 20: illegal multibyte sequence 这个错误应该是vocab.txt编码的问题你可以暂时不运行gen_vocab.py 直接使用作者的vocab.txt，等调试通过后再更换成自己的字典集

wang-zhix commented 5 months ago

ValueError: Mixed precision training with AMP or APEX (--fp16 or --bf16) and half precision evaluation (--fp16_full_eval or --bf16_full_eval) can only be used on CUDA devices. 我这边没有遇到这个问题，这个问题应该是torch版本引起的

你可以尝试在train.py中关闭fp16=False 或者尝试更换环境

这是我使用的环境版本（仅供参考：但并不能保证解决你的问题）： python 3.10.13 torch 1.13.1+cu116 torchaudio 0.13.1+cu116 torchvision 0.14.1+cu116 transformers 4.37.0 scikit-learn 1.4.0 jiwer 3.0.3

EHEROCORE commented 5 months ago

非常感谢您的指导。我把dataset中的vocab文件换回作者上传的版本，并且更改了dataset.py文件的getitem方法里读取文件时所使用的编码为utf-8，在这之后就能正常训练模型了，非常感谢！

Gmgge commented 5 months ago

是使用windows环境进行训练嘛？通常来说这个问题是由于python读取文件会默认使用操作系统的默认编码格式，windows由于历史原因通常是gbk，而linux通常是utf-8编码。

我修复下该问题，显示声明下读取的编码要求。

EHEROCORE commented 5 months ago

您好感谢您的指导我确实是使用windows环境进行训练的

EHEROCORE commented 5 months ago

您好我之前使用这个模型跑出来的结果并不好，猜测是数据集体量过少导致的，我使用了ppocrlabel标注了8000组印章数据，想用这个模型重新跑看看效果，但是这次跑的时候出现了问题，提示 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 69: invalid continuation byte 完整的报错数据提供给gpt看后推测是错误发生在尝试加载 TrOCRProcessor 或 RobertaTokenizer 时，具体是在读取 JSON 格式的词汇表文件或模型配置文件。如果这些文件不是以 UTF-8 编码保存的，就可能出现这个错误。但是我搜索这两个文件没有结果，想请问下有解决方法吗。另外补充：vocab.txt以及印章数据的txt文件我都确认了是utf-8编码的，但不排除是有编码错误的内容导致不能识别

Gmgge commented 5 months ago

1.如果是个别情况，你可以尝试增加try来捕获； 2.可以写个脚本单独检查标签数据，进而处理编码可能错误的标签

Gmgge commented 5 months ago

或者你可以截图下报错，需要具体的报错代码行数，一些上下文信息。

EHEROCORE commented 5 months ago

感谢您的回复，具体的报错代码如下，由于我是命令行运行的，该窗口已经关闭，只有下午复制给gpt的报错反馈内容，望见谅。 (pytorch) E:\Design_seal_recognition_system\TrOCR-Seal-Recognition-main\TrOCR-Seal-Recognition-main>python train.py --cust_data_init_weights_path ./cust-data/weights --checkpoint_path ./checkpoint/trocr-custdata --dataset_path "./dataset/cust-data/" --per_device_train_batch_size 8 --CUDA_VISIBLE_DEVICES 0 train param Namespace(cust_data_init_weights_path='./cust-data/weights', checkpoint_path='./checkpoint/trocr-custdata', dataset_path='./dataset/cust-data/', per_device_train_batch_size=8, per_device_eval_batch_size=8, max_target_length=128, num_train_epochs=10, eval_steps=1000, save_steps=1000, CUDA_VISIBLE_DEVICES='0') loading data ................. data count: 8000 train num: 7600 test num: 400 Traceback (most recent call last): File "E:\Design_seal_recognition_system\TrOCR-Seal-Recognition-main\TrOCR-Seal-Recognition-main\train.py", line 63, in processor = TrOCRProcessor.from_pretrained(args.cust_data_init_weights_path) File "D:\anaconda3\envs\pytorch\lib\site-packages\transformers\models\trocr\processing_trocr.py", line 109, in from_pretrained tokenizer = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, *kwargs) File "D:\anaconda3\envs\pytorch\lib\site-packages\transformers\tokenization_utils_base.py", line 1747, in from_pretrained return cls._from_pretrained( File "D:\anaconda3\envs\pytorch\lib\site-packages\transformers\tokenization_utils_base.py", line 1882, in _from_pretrained tokenizer = cls(init_inputs, **init_kwargs) File "D:\anaconda3\envs\pytorch\lib\site-packages\transformers\models\roberta\tokenization_roberta.py", line 166, in init super().init( File "D:\anaconda3\envs\pytorch\lib\site-packages\transformers\models\gpt2\tokenization_gpt2.py", line 181, in init self.encoder = json.load(vocab_handle) File "D:\anaconda3\envs\pytorch\lib\json__init__.py", line 293, in load return loads(fp.read(), File "D:\anaconda3\envs\pytorch\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 69: invalid continuation byte

EHEROCORE commented 5 months ago

此外，我写了个脚本把数据集格式变成了readme文件中使用的一个文件夹套两个成组的印章数据的格式，就像0.png与0.txt放在一个以对应数字编号命名的文件夹中，对应readme中的这个命令 --dataset_path "dataset/cust-data/0/" 我修改了gen_vocab.py中的内容，增加了递归遍历所有子目录的功能 def find_files(directory, pattern): """递归遍历所有文件""" for root, dirs, files in os.walk(directory): for basename in files: if fnmatch.fnmatch(basename, pattern): filename = os.path.join(root, basename) yield filename

if name == 'main': parser = argparse.ArgumentParser(description='生成自定义 vocab 文件') parser.add_argument('--cust_vocab', default="./cust-data/vocab.txt", type=str, help="自定义 vocab 文件生成路径") parser.add_argument('--dataset_path', default="./dataset/train/", type=str, help="训练数据字符集根路径") args = parser.parse_args()

vocab = set()
# 修改为使用 find_files 函数递归查找所有 .txt 文件
for txt_file in tqdm(find_files(args.dataset_path, "*.txt")):
    with codecs.open(txt_file, encoding='utf-8') as f:
        txt = f.read().strip()
    vocab.update(txt)

root_path = os.path.split(args.cust_vocab)[0]
os.makedirs(root_path, exist_ok=True)
with open(args.cust_vocab, 'w', encoding='utf-8') as f:
    f.write('\n'.join(sorted(list(vocab))))

并且也修改了修改 file_tool.py 中的 get_image_file_list 函数，使其也能遍历子目录 def get_image_file_list(img_file): imgs_lists = [] if img_file is None or not os.path.exists(img_file): raise Exception("not found any img file in {}".format(img_file))

img_end = {'jpg', 'bmp', 'png', 'jpeg', 'rgb', 'ppm', 'tiff', 'gif', 'webp'}

# 如果是文件夹，则递归查找所有子目录中的图像文件
if os.path.isdir(img_file):
    for root, dirs, files in os.walk(img_file):
        for file in files:
            file_path = os.path.join(root, file)
            if os.path.isfile(file_path) and imghdr.what(file_path) in img_end:
                imgs_lists.append(file_path)
else:
    if os.path.isfile(img_file) and imghdr.what(img_file) in img_end:
        imgs_lists.append(img_file)

if len(imgs_lists) == 0:
    raise Exception("not found any img file in {}".format(img_file))

imgs_lists = sorted(imgs_lists)
return imgs_lists这样修改后，无论图像文件是直接存放在指定目录下，还是存放在该目录的任何子目录中，get_image_file_list 函数都能正确地找到并返回这些图像文件的路径列表。

不知道是不是这个修改导致了问题，但是我猜想应该不是，在我修改之前也会有相似的关于utf-8的报错，只是报错显示的行数不同，是 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 20: invalid continuation byte 之前position的位置是了20，不知道这些信息对于定位错误原因是否有帮助，万分感谢。

Gmgge commented 5 months ago

从报错上面看File "E:\Design_seal_recognition_system\TrOCR-Seal-Recognition-main\TrOCR-Seal-Recognition-main\train.py", line 63, in processor = TrOCRProcessor.from_pretrained(args.cust_data_init_weights_path)，这一步并没有到数据读取的阶段，你需要debug到这一步 File "D:\anaconda3\envs\pytorch\lib\site-packages\transformers\models\gpt2\tokenization_gpt2.py", line 181, in init self.encoder = json.load(vocab_handle)，请查看该步骤中load的文件流是以什么方式打开的，应该存在 vocab_handle文件流读取的open 请指定utf-8 编码

可以参考该问题https://github.com/huggingface/transformers/issues/1125

同时我真诚的向你推荐，在初期以及中期的阶段，请使用linux系统进行训练，该行为可以有效减少不必要的麻烦，节省下时间来享受人生！

Gmgge / TrOCR-Seal-Recognition

关于import报错 #20