文本数据增强后，显存溢出

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

12.1k stars 2.93k forks source link

文本数据增强后，显存溢出 #2171

Closed kingkingofall closed 1 year ago

kingkingofall commented 2 years ago

关于使用了nlpcda进行数据增强后，出现了显存溢出问题，即使是将batch_size调小也同样会出问题

数据增强部分代码：

from nlpcda import CharPositionExchange, Homophone smw = CharPositionExchange(create_num=2, change_rate=0.3,char_gram=3,seed=1024)

class MYDataset(paddle.io.Dataset): def init(self, sents, labels): self.sents = sents self.labels = labels

def __getitem__(self, index):

    data = self.sents[index]
    label = self.labels[index]

    # 数据增强
    data = smw.replace(data)[1] #注释掉这句显存不会溢出，不注释掉，训练一会儿就会溢出

    result = {'text_a':data, "label":label}
    return convert_example(result, tokenizer, max_seq_length=128, is_test=False)

def __len__(self):

    return len(self.sents)

ZeyuChen commented 2 years ago

数据增强的时候是否有确保batch size没变？显存溢出大概率是batch size变大了很多

kingkingofall commented 2 years ago

是的，区别只有是否注释了数据增强那句

ZeyuChen commented 2 years ago

@kingkingofall 我的意思是你的这个函数，可能改变了batch size，导致显存变大。可能需要研究下怎么控制数据增强不改变batch size

ZeyuChen commented 2 years ago

另外就是试一下，把这个数据增强操作，不要放在get item这个环节做。在训练For循环的过程中，用Data Collator的方式来做。

ZeyuChen commented 2 years ago

因为你这个操作可能会不断个更改原始数据，导致数据过长活着batch size过大，导致显存溢出。需要在get item之后的环节处理，确保不篡改了原始的数据。

kingkingofall commented 2 years ago

哦哦，好的，谢谢

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。