pyltp 0.2.1 在连续使用 srl 时，出现内存忽然增加导致进程被 kill，怀疑有 Bug

wenfeixiang1991 commented 5 years ago

在提问之前，请确认以下几点:

[x] 如果您对算法或C++实现有问题，请在https://github.com/HIT-SCIR/ltp/issues提问
[x] 由于您的问题可能与前任问题重复，在提交issue前，请您确认您已经搜索过之前的问题

问题类型

1、内存错误 2、[dynet] random seed: 254078971 中的 seed 为什么每次都随机？难道不应该固定吗？

labeller = SementicRoleLabeller()
labeller.load(srl_model_path)

>>>
[dynet] random seed: 254078971
[dynet] allocating memory: 2000MB
[dynet] memory allocation done.

出错场景

情况1、在连续应用 segmentor, sostagger, sarser, sementicRoleLabeller 对句子（小于500字）进行 srl 时，内存会由开始的 4G 多到 6G 多，再到 10G 左右稳定，再持续一段时间到突然 13G 然后被 kill 掉。情况2、一开始运行内存就很快升高到 13、16G 左右，导致还没等对一个句子 srl 成功就已经被 kill 掉了。情况3、同情况1一样，但最后并不是被 kill 掉，而是报 CPU memory allocation failed n =11173625856 align=32 Exception CPU memory allocation failed 然后卡住，而不是被 kill 掉，此时仍然占用内存，大概 13G 左右吧。

情况2、情况3 偶然发生，情况1一直发生，虽然开始能运行，能对对 3万多个个句子（小于500字）持续 srl，但被 kill 掉只是时间问题。

已经参考过问题 #141，怀疑是内存泄漏问题，还请解决这个问题。

代码片段

# load ltp =============================================
LTP_DATA_DIR = './ltp_data_v3.4.0'  # ltp模型目录的路径
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model')  # 分词模型路径，模型名称为`cws.model`
pos_model_path = os.path.join(LTP_DATA_DIR, 'pos.model')  # 词性标注模型路径，模型名称为`pos.model`
par_model_path = os.path.join(LTP_DATA_DIR, 'parser.model')  # 依存句法分析模型路径，模型名称为`parser.model`
srl_model_path = os.path.join(LTP_DATA_DIR, 'pisrl.model')  # 语义角色标注模型目录路径，模型目录为`srl`。注意该模型路径是一个目录，而不是一个文件。

segmentor = Segmentor()
#segmentor.load(cws_model_path)  # 加载模型，第二个参数是外部词典文件路径
segmentor.load_with_lexicon(cws_model_path, './dict_for_ltp/ltp_customer.txt') 
postagger = Postagger()
#postagger.load(pos_model_path)
postagger.load_with_lexicon(pos_model_path, './dict_for_ltp/ltp_customer.txt')
parser = Parser()
parser.load(par_model_path)
labeller = SementicRoleLabeller()
labeller.load(srl_model_path)
# ======================================================

read_file = './xxxx.txt'
write_file = './xxxx.txt'

#articles = load_json_line_data(read_file)

killed_count = 32832

count = 0

print('read file ... ')
with io.open(read_file, "r", encoding='utf-8') as f:
    while True:
        line = f.readline()
        #print('read line ... ')
        if len(line) > 0:
            count += 1
            if count <= killed_count:
                continue
            try:
                article = json.loads(line.strip())
                temp_dic = {}
                title = article['title']
                print('srl title ...')
                title_srl_result = get_event_triples_srl(title)
                print('srl title success! ')
                temp_dic['title'] = title
                temp_dic['title_srl_result'] = title_srl_result
                temp_dic['url'] = article['url']
                temp_dic['publishAt'] = article['publishAt']

                sentences_srl_result = []
                p = article['event_discription']
                for s in p.split('。'):
                    for s1 in s.split('；'):

                        if len(s1) > 500:
                            continue

                        if len(s1.strip()) > 0:
                            print('srl sentence ...')
                            s1_srl_result = get_event_triples_srl(s1)
                            print('srl sentence success! ')
                            sentences_srl_result.append({'sentence': s1,
                                                         'sentence_srl_result': s1_srl_result})

                temp_dic['sentences_srl_result'] = sentences_srl_result

                with io.open(write_file, 'a', encoding='utf-8') as f1:
                    f1.write(json.dumps(temp_dic, ensure_ascii=False) + "\n")

                if count % 1000 == 1:
                    print('count: ', count)
                print('count: ', count)
            except Exception as e:
                print('ltp error')
                print("Exception: {}".format(e))
        else:
            break
# -------------------------------------------------------
segmentor.release()
postagger.release()
parser.release()
labeller.release()

其中 def get_event_triples_srl(sentence):

sentence_srl_result = {}

words = segmentor.segment(sentence)  # 分词
words = '\t'.join(words)
words = words.split('\t')
postags = postagger.postag(words)  # 词性标注
postags = '\t'.join(postags)
postags = postags.split('\t')
#print('words: ', words)
#print('postags: ', postags)
sentence_srl_result['words'] = words
sentence_srl_result['postags'] = postags

arcs = parser.parse(words, postags)  # 句法分析
roles = labeller.label(words, postags, arcs)  # 语义角色标注
然后是 roles 的进一步处理

如何复现这一错误

运行环境

Linux python 3.6 pyltp==0.2.1 模型 ltp_data_v3.4.0

期望结果

其他

Please ensure your issue adheres to the following guidelines:

[ ] If there is an algorithm or native (c++) problem. Go to https://github.com/HIT-SCIR/ltp/issues
[ ] Search previous issues before making a new one, as yours may be a duplicate.

What is affected by this bug?

When does this occur?

Where on the code does it happen?

How do we replicate the issue?

Your environment information

Expected behavior (i.e. solution)

Other Comments

wenfeixiang1991 commented 5 years ago

@liu946 这个项目现在不更新了吗？以后是否还打算更新呢？

liu946 commented 5 years ago

@wenfeixiang1991 开源版本不在进行更新。我们的最新进展会上线讯飞开放平台，欢迎大家使用 https://www.xfyun.cn/services/lexicalAnalysis 。

wenfeixiang1991 commented 5 years ago

奥奥，好的，那这个 pyltp 中的这个内存问题是否能劳烦解决一下呢？现在做不了稍大一点的数据实验分析，很头痛

wenfeixiang1991 commented 5 years ago

@liu946 我在想如果能解决这个问题，即使不再更新，也还是可以用的，那就太感谢啦！ :)

HIT-SCIR / pyltp