fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
https://gitee.com/fastnlp/fastNLP
Apache License 2.0
3.05k stars 451 forks source link

[bugfix]修复Trainer里check_code函数忽略pin_memory参数导致的内存bug #400

Open ouyhlan opened 2 years ago

ouyhlan commented 2 years ago

Description:修复Trainer里check_code函数忽略pin_memory参数导致的内存不足bug

Main reason: 在使用fastNLP库时发生内存不足错误。使用场景是在使用CPU训练模型时,发生了内存错误。经过DEBUG发现,是core/trainer.py文件里,_check_code函数在调用Tester类时没有指定pin_memory参数,而Tester类默认初始化pin_memory为True。

具体错误调用栈:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=2 : out of memory
Traceback (most recent call last):
  File "/data/ouyhlan/TextClassification/main.py", line 52, in <module>
    trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss,
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/trainer.py", line 558, in __init__
    _check_code(dataset=train_data, model=self.model, losser=losser, forward_func=self._forward_func, metrics=metrics,
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/trainer.py", line 1013, in _check_code
    evaluate_results = tester.test()
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/tester.py", line 184, in test
    for batch_x, batch_y in data_iterator:
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/batch.py", line 266, in __iter__
    for indices, batch_x, batch_y in self.dataiter:
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 477, in _next_data
    data = _utils.pin_memory.pin_memory(data)
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
    return [pin_memory(sample) for sample in data]
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in <listcomp>
    return [pin_memory(sample) for sample in data]
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in pin_memory
    return {k: pin_memory(sample) for k, sample in data.items()}
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in <dictcomp>
    return {k: pin_memory(sample) for k, sample in data.items()}
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
    return data.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

pin_memory参数设为False后问题消失。同时,根据https://github.com/pytorch/pytorch/issues/57273 ,建议所有的torch版本里Trainer和Tester类默认不开启pin_memory。

Checklist 检查下面各项是否完成

Please feel free to remove inapplicable items for your PR.

Changes: 逐项描述修改的内容

Mention: 找人review你的PR @yhcc

yhcc commented 2 years ago

非常感谢您的再次提交,上次我没有认真检查您提到的细节,我刚才再次看了下这部分的修改内容。这里我倾向于默认打开,是由于大家真的在跑神经网络的时候,大部分时间都会在有gpu的服务器上,这点内存消耗应该对于服务器来说比较容易接受(这里是由于pin_memory确实会加速data准备过程,感觉默认开启可以为大家节省一点时间);遭遇了内存问题之后,也可以通过pin_memory手动关闭。我检查了一下代码,发现Trainer中check_code应该是默认就关闭了pin_memory吧?https://github.com/fastnlp/fastNLP/blob/9ac7d09431f762ceea98904ecb9ac9200a178c29/fastNLP/core/trainer.py#L957

ouyhlan commented 2 years ago

@yhcc

非常感谢您的再次提交,上次我没有认真检查您提到的细节,我刚才再次看了下这部分的修改内容。这里我倾向于默认打开,是由于大家真的在跑神经网络的时候,大部分时间都会在有gpu的服务器上,这点内存消耗应该对于服务器来说比较容易接受(这里是由于pin_memory确实会加速data准备过程,感觉默认开启可以为大家节省一点时间);遭遇了内存问题之后,也可以通过pin_memory手动关闭。我检查了一下代码,发现Trainer中check_code应该是默认就关闭了pin_memory吧?

https://github.com/fastnlp/fastNLP/blob/9ac7d09431f762ceea98904ecb9ac9200a178c29/fastNLP/core/trainer.py#L957

Trainer中check_code的问题是出现在下面这行代码: https://github.com/fastnlp/fastNLP/blob/9ac7d09431f762ceea98904ecb9ac9200a178c29/fastNLP/core/trainer.py#L1012-L1013 这行使用Tester的时候没有传入pin_memory参数,再看到Tester的初始化方法: https://github.com/fastnlp/fastNLP/blob/9ac7d09431f762ceea98904ecb9ac9200a178c29/fastNLP/core/tester.py#L116 也就是说,不管是否给trainer传入pin_memory,这里的pin_memory都是默认开启的。 同时,在有gpu的服务器,我也复现过该内存问题。大概场景是同一个服务器有多人在不同卡上跑代码,导致服务器内存不是特别充足,那么pin_memory这部分的消耗就会有影响了。如果不想更改默认方式,也可以参考一下我之前提交的另外一种修复方式:https://github.com/ouyhlan/fastNLP/commit/cac13311e28c1e8e3c866d50656173650eb5c7a1

yhcc commented 2 years ago

嗯嗯,我觉得您提到的另一种方式应该会更好一些,

yhcc commented 2 years ago

非常感谢~麻烦您按照提到的那份代码发起一下pr吧~

ouyhlan commented 2 years ago

@yhcc

非常感谢~麻烦您按照提到的那份代码发起一下pr吧~

已经push -f到这个PR里了,麻烦直接在这个PR里review一下改动就好~

ouyhlan commented 2 years ago

@yhcc