数据处理部分的代码的clip没生效

mirrorQAQ commented 3 years ago

在32G 的卡上跑，clip 默认设置的200。 batch-size=1都会溢出。也是服了。然后发现preprocess里面的clip 其实没有clip掉。最长的句子竟然会出现1146个char 溢出应该也正常吧。。。修改了代码，贴上来，有遇到同样问题的人可以参考一下。

import argparse
import numpy as np
import os
import sys

sys.path.append('../')
from paths import *

parser = argparse.ArgumentParser()
parser.add_argument('--clip_msra', default=True, action='store_true')
parser.add_argument('--clip_size', default=200, help='soft clip your sequence!')
parser.add_argument('--max_seq_len', default=300,
                    help='When more than this number, skip!')
parser.add_argument('--train_set', default='train_bio')
parser.add_argument('--test_set', default='test_bio')

args = parser.parse_args()

segment_split = [',', '!', '.', '。', '!', '……', '?', '？', '，', '...']
lexicon_f = open(yangjie_rich_pretrain_word_path, 'r')
char_f = open(yangjie_rich_pretrain_unigram_path, 'r')

output_f = open(yangjie_rich_pretrain_char_and_word_path, 'w')

lexicon_lines = lexicon_f.readlines()
for l in lexicon_lines:
  l_split = l.strip().split()
  if len(l_split[0]) != 1:
    print(l.strip(), file=output_f)

char_lines = char_f.readlines()
for l in char_lines:
  print(l.strip(), file=output_f)

def need_clip(now_segment, current_line):
  if len(current_line) <= 1:
    return True
  if len(now_segment) >= args.clip_size and (
    current_line[0] in segment_split or current_line[1][0].lower() == 'e'):
    return True
  return False

def create_cliped_file(fp):
  segment = ''
  data_set, data_char, now_segment = [], [], []
  with open(fp, 'r', encoding='utf-8') as file:
    for line in file:
      line_split = line.strip().split()
      if len(line_split) <= 0:
        continue
      if need_clip(now_segment=now_segment, current_line=line_split):
        if len(now_segment) <= args.max_seq_len:
          data_set.append(segment)
          data_char.append(now_segment)
        now_segment = []
        segment = ''
      else:
        now_segment.append(line_split[0])
        segment += line
  with open(f'{fp}_clip_{args.clip_size}', 'w', encoding='utf-8') as file:
    for seg in data_set:
      file.write(seg + '\n\n')

  res = np.array([len(seq) for seq in data_char])
  overflow = np.count_nonzero([seq_len > args.clip_size for seq_len in res])
  max_seq_len = np.max(res)
  print(f'Max-Seq-Len in this dataset is: {max_seq_len}')
  print(f'There are {overflow} sentences that are more than the clip-size.')

if args.clip_msra:
  create_cliped_file(os.path.join(msra_ner_cn_path, args.test_set))
  create_cliped_file(os.path.join(msra_ner_cn_path, args.train_set))

Riroaki commented 3 years ago

老哥，感谢！我用了你的代码，现在在msra数据集的句子长度截断了，但是训练时依然会报OOM，不知道你有没有遇到？

train:7026
train max_seq_len:387
train max_lex_num:233
train max_seq_lex:601
test max_seq_len:351
test max_lex_num:231
test max_seq_lex:572
loading vocabulary file /home/llq/.fastNLP/embedding/bert-chinese-wwm/vocab.txt
Load pre-trained BERT parameters from file /home/llq/.fastNLP/embedding/bert-chinese-wwm/chinese_wwm_pytorch.bin.
Start to generate word pieces for word.
Found(Or segment into word pieces) 116198 words out of 116559.
training epochs started 2020-12-15-13-55-24
Traceback (most recent call last):
  File "flat_main.py", line 588, in <module>
    trainer.train()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 613, in train
    self.callback_manager.on_exception(e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 309, in wrapper
    returns.append(getattr(callback, func.__name__)(*arg))
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 505, in on_exception
    raise exception  # 抛出陌生Error
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 609, in train
    self._train()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 697, in _train
    eval_res = self._do_validation(epoch=epoch, step=self.step)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 714, in _do_validation
    res = self.tester.test()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 165, in test
    pred_dict = self._data_forward(self._predict_func, batch_x)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 213, in _data_forward
    y = self._predict_func_wrapper(**x)
  File "/home/llq/homework/Flat-Lattice-Transformer/models.py", line 511, in forward
    embedding, seq_len, lex_num=lex_num, pos_s=pos_s, pos_e=pos_e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 1277, in forward
    rel_pos_embedding = self.four_pos_fusion_embedding(pos_s,pos_e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 110, in forward
    pe_2 = torch.cat([pe_ss,pe_ee],dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 3.43 GiB (GPU 0; 10.91 GiB total capacity; 7.58 GiB already allocated; 2.49 GiB free; 7.87 GiB reserved in total by PyTorch)

fan31415 commented 3 years ago

你把batch size调小试试, 16尝试一下

Riroaki commented 3 years ago

我设置了batch为2，梯度积累5步，可是中间还是炸了。。。

Riroaki commented 3 years ago

估计是测试的test_batch太大了（设置是10），debug发现是在测试爆的

mirrorQAQ commented 3 years ago

老哥，感谢！我用了你的代码，现在在msra数据集的句子长度截断了，但是训练时依然会报OOM，不知道你有没有遇到？

train:7026
train max_seq_len:387
train max_lex_num:233
train max_seq_lex:601
test max_seq_len:351
test max_lex_num:231
test max_seq_lex:572
loading vocabulary file /home/llq/.fastNLP/embedding/bert-chinese-wwm/vocab.txt
Load pre-trained BERT parameters from file /home/llq/.fastNLP/embedding/bert-chinese-wwm/chinese_wwm_pytorch.bin.
Start to generate word pieces for word.
Found(Or segment into word pieces) 116198 words out of 116559.
training epochs started 2020-12-15-13-55-24
Traceback (most recent call last):
  File "flat_main.py", line 588, in <module>
    trainer.train()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 613, in train
    self.callback_manager.on_exception(e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 309, in wrapper
    returns.append(getattr(callback, func.__name__)(*arg))
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 505, in on_exception
    raise exception  # 抛出陌生Error
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 609, in train
    self._train()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 697, in _train
    eval_res = self._do_validation(epoch=epoch, step=self.step)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 714, in _do_validation
    res = self.tester.test()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 165, in test
    pred_dict = self._data_forward(self._predict_func, batch_x)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 213, in _data_forward
    y = self._predict_func_wrapper(**x)
  File "/home/llq/homework/Flat-Lattice-Transformer/models.py", line 511, in forward
    embedding, seq_len, lex_num=lex_num, pos_s=pos_s, pos_e=pos_e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 1277, in forward
    rel_pos_embedding = self.four_pos_fusion_embedding(pos_s,pos_e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 110, in forward
    pe_2 = torch.cat([pe_ss,pe_ee],dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 3.43 GiB (GPU 0; 10.91 GiB total capacity; 7.58 GiB already allocated; 2.49 GiB free; 7.87 GiB reserved in total by PyTorch)

啊你是不是没有清理cache啊。我看你这边的错误信息，如果你用的我是我的代码处理过的句子的话不会出现长度超过300的样本哦。超过300的我这边是skip了

Riroaki commented 3 years ago

老哥，感谢！我用了你的代码，现在在msra数据集的句子长度截断了，但是训练时依然会报OOM，不知道你有没有遇到？

train:7026
train max_seq_len:387
train max_lex_num:233
train max_seq_lex:601
test max_seq_len:351
test max_lex_num:231
test max_seq_lex:572
loading vocabulary file /home/llq/.fastNLP/embedding/bert-chinese-wwm/vocab.txt
Load pre-trained BERT parameters from file /home/llq/.fastNLP/embedding/bert-chinese-wwm/chinese_wwm_pytorch.bin.
Start to generate word pieces for word.
Found(Or segment into word pieces) 116198 words out of 116559.
training epochs started 2020-12-15-13-55-24
Traceback (most recent call last):
  File "flat_main.py", line 588, in <module>
    trainer.train()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 613, in train
    self.callback_manager.on_exception(e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 309, in wrapper
    returns.append(getattr(callback, func.__name__)(*arg))
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 505, in on_exception
    raise exception  # 抛出陌生Error
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 609, in train
    self._train()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 697, in _train
    eval_res = self._do_validation(epoch=epoch, step=self.step)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 714, in _do_validation
    res = self.tester.test()
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 165, in test
    pred_dict = self._data_forward(self._predict_func, batch_x)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 213, in _data_forward
    y = self._predict_func_wrapper(**x)
  File "/home/llq/homework/Flat-Lattice-Transformer/models.py", line 511, in forward
    embedding, seq_len, lex_num=lex_num, pos_s=pos_s, pos_e=pos_e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 1277, in forward
    rel_pos_embedding = self.four_pos_fusion_embedding(pos_s,pos_e)
  File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 110, in forward
    pe_2 = torch.cat([pe_ss,pe_ee],dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 3.43 GiB (GPU 0; 10.91 GiB total capacity; 7.58 GiB already allocated; 2.49 GiB free; 7.87 GiB reserved in total by PyTorch)

啊你是不是没有清理cache啊。我看你这边的错误信息，如果你用的我是我的代码处理过的句子的话不会出现长度超过300的样本哦。超过300的我这边是skip了

我修改了一下长度，cache清理过了，目前应该是test时候batch太大导致出错，总之谢谢老哥的代码😄

wing7171 commented 2 years ago

用了你的代码解决了OOM的问题，感谢老哥！

wing7171 commented 2 years ago

您好，请问这个cache说的是缓存的数据集吗（我理解不需要清理呀，它存在磁盘，运行的时候才会去加载）还是在训练中需要什么操作可以防止爆显存呢～谢谢。

LeeSureman / Flat-Lattice-Transformer

数据处理部分的代码的clip没生效 #53