PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.1k stars 2.93k forks source link

文本数据增强后,显存溢出 #2171

Closed kingkingofall closed 1 year ago

kingkingofall commented 2 years ago

关于使用了nlpcda进行数据增强后,出现了显存溢出问题,即使是将batch_size调小也同样会出问题

数据增强部分代码:

from nlpcda import CharPositionExchange, Homophone smw = CharPositionExchange(create_num=2, change_rate=0.3,char_gram=3,seed=1024)

class MYDataset(paddle.io.Dataset): def init(self, sents, labels): self.sents = sents self.labels = labels

def __getitem__(self, index):

    data = self.sents[index]
    label = self.labels[index]

    # 数据增强
    data = smw.replace(data)[1] #注释掉这句显存不会溢出,不注释掉,训练一会儿就会溢出

    result = {'text_a':data, "label":label}
    return convert_example(result, tokenizer, max_seq_length=128, is_test=False)

def __len__(self):

    return len(self.sents)
ZeyuChen commented 2 years ago

数据增强的时候是否有确保batch size没变?显存溢出大概率是batch size变大了很多

kingkingofall commented 2 years ago

是的,区别只有是否注释了数据增强那句

ZeyuChen commented 2 years ago

@kingkingofall 我的意思是你的这个函数,可能改变了batch size,导致显存变大。可能需要研究下怎么控制数据增强不改变batch size

ZeyuChen commented 2 years ago

另外就是试一下,把这个数据增强操作,不要放在get item这个环节做。 在训练For循环的过程中,用Data Collator的方式来做。

ZeyuChen commented 2 years ago

因为你这个操作可能会不断个更改原始数据,导致数据过长活着batch size过大,导致显存溢出。 需要在get item之后的环节处理,确保不篡改了原始的数据。

kingkingofall commented 2 years ago

哦哦,好的,谢谢

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。