PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.13k stars 5.55k forks source link

tensor.numpy()执行大量数据从GPU拷贝到CPU速度缓慢 #46350

Closed liukaiyueyuo closed 1 year ago

liukaiyueyuo commented 2 years ago

需求描述 Feature Description

tensor.numpy()执行大量数据从GPU拷贝到CPU速度缓慢,5M数据执行tensor.numpy()耗费了1.4s,完全不可接受!啥原因呢?

替代实现 Alternatives

No response

paddle-bot[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

w5688414 commented 2 years ago

tensor需要从gpu拷贝到cpu,需要消耗时间。

import paddle
import paddle.nn.functional as F
import numpy as np
import time
x = paddle.zeros((256,384,5000))
# x = paddle.to_tensor(x, place=paddle.CPUPlace())
# x = paddle.to_tensor(x)
print(x.shape)

# paddle.device.set_device("cpu")
start_time = time.time()
y = x.numpy()
end_time = time.time()
print('time cost: {}'.format(end_time-start_time))
print(y.shape)
[256, 384, 5000]
time cost: 1.4152092933654785
(256, 384, 5000)

我测了一下,确实需要一点时间,请问您有什么特别的需求吗?

liukaiyueyuo commented 2 years ago

局部代码如下: with paddle.no_grad(): wav = self.voc_inference(self.am_inference(phone_ids, spk_emb=spk_emb)) print("wav shape: %s" % wav.shape) time_1 = time.time() gpu2cpu_1 = wav.numpy() time_2 = time.time() print('1-2 time: {}'.format(time_2 - time_1)) wav2 = paddle.full_like(wav, 0.0) time_3 = time.time() gpu2cpu_2 = wav2.numpy() time_4 = time.time() print('3-4 time: {}'.format(time_4 - time_3))

打印输出结果: wav shape: [2582700, 1] 1-2 time: 1.4252309799194336 3-4 time: 0.0017809867858886719

为啥第一次把GPU上tensor通过numpy拷贝到CPU需要耗费1.425s,而同样大小的tensor第二次拷贝到CPU只需要0.00178s? 其中第一次的tensor为paddle模型输出tensor,而且仅仅为[2582700, 1]的二维tensor,远小于你的测试代码tensor大小

w5688414 commented 2 years ago

是不是io没有释放,我在v100上测试的时间很低:

import paddle
import paddle.nn.functional as F
import numpy as np
import time
x = paddle.zeros((2582700,1))
# x = paddle.to_tensor(x, place=paddle.CPUPlace())
# x = paddle.to_tensor(x)
print(x.shape)
start_time = time.time()
y = x.numpy()
end_time = time.time()
print('time cost: {}'.format(end_time-start_time))
print(y.shape)

目前没法复现您的问题

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script [2582700, 1] time cost: 0.00889444351196289 (2582700, 1)

liukaiyueyuo commented 2 years ago

你们的PaddleSpeech中examples/aishell3/vc1示例: https://github.com/PaddlePaddle/PaddleSpeech.git

打开脚本文件examples/aishell3/vc1/local/voice_cloning.sh 注释环境变,如下:

text="这几天心里颇不宁静,今晚在院子里坐着乘凉,忽然想起日日走过的荷塘,在这满月的光里,总该另有一番样子吧,月亮渐渐地升高了,墙外马路上孩子们的欢笑 ,已经听不见了,妻在屋里拍着闰儿,迷迷糊糊地哼着眠歌,我悄悄地披了大衫,带上门出去,沿着荷塘,是一条曲折的小煤屑路,这是一条幽僻的路,白天也少人走, 夜晚更加寂寞,荷塘四面,长着许多树,蓊蓊郁郁的,路的一旁,是些杨柳,和一些不知道名字的树,没有月光的晚上,这路上阴森森的,有些怕人,今晚却很好,虽然 月光也还是淡淡的,路上只我一个人,背着手踱着,这一片天地好像是我的,我也像超出了平常的自己,到了另一个世界里,我爱热闹,也爱冷静,爱群居,也爱独处, 像今晚上,一个人在这苍茫的月下,什么都可以想,什么都可以不想,便觉是个自由的人,白天里一定要做的事,一定要说的话,现在都可不理,这是独处的妙处,我且 受用这无边的荷香月色好了,曲曲折折的荷塘上面,弥望的是田田的叶子,叶子出水很高,像亭亭的舞女的裙,层层的叶子中间,零星地点缀着些白花,有袅娜地开着的 ,有羞涩地打着朵儿的,正如一粒粒的明珠,又如碧天里的星星,又如刚出浴的美人,微风过处,送来缕缕清香,仿佛远处高楼上渺茫的歌声似的,这时候叶子与花也有 一丝的颤动,像闪电般,霎时传过荷塘的那边去了,叶子本是肩并肩密密地挨着,这便宛然有了一道凝碧的波痕,叶子底下是脉脉的流水,遮住了,不能见一些颜色,而 叶子却更见风致了。"

FLAGS_allocator_strategy=naive_best_fit \

FLAGS_fraction_of_gpu_memory_to_use=0.01 \

/data/user/liukai/env/miniconda3/bin/python3 ${BIN_DIR}/../voice_cloning.py \ --am=fastspeech2_aishell3 \ --am_config=${config_path} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_stat=dump/train/speech_stats.npy \ --voc=pwgan_aishell3 \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --ge2e_params_path=${ge2e_params_path} \ --text=${text} \ --input-dir=${ref_audio_dir} \ --output-dir=${train_output_path}/vc_syn \ --phones-dict=dump/phone_id_map.txt

打开源码paddlespeech/t2s/exps/voice_cloning.py,修改部分如下: import time print("1: %s" % time.time()) wav = wav.numpy() print("2: %s" % time.time()) sf.write( str(output_dir / (utt_id + ".wav")), wav, samplerate=am_config.fs)

最后执行examples/aishell3/vc1目录下脚本run.sh: ./run.sh --stage 3 --stop-stage 3

会发现wav.numpy()耗时1.4s!

注释GPU显存管理环境变量:voice_cloning.py中模型语音克隆推理时间wav = voc_inference(am_inference(phone_ids, spk_emb=random_spk_emb))耗时0.8s, wav.numpy()耗时1.4s 但如果不注释GPU显存管理环境变量,voice_cloning.py中模型语音克隆推理时间wav = voc_inference(am_inference(phone_ids, spk_emb=random_spk_emb))需要不可思议的10s,wav.numpy()耗时0.002s

所以问题是?

yt605155624 commented 2 years ago

我复现了一下,不注释这两行

FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
inference time: 10.39250373840332
wav.numpy() time: 0.010816574096679688

注释的话,时间是

inference time: 0.9181728363037109
wav.numpy() time: 1.3469219207763672

我在执行 vc2 的时候,不注释

inference time: 10.800944805145264
wav.numpy() time: 0.008228302001953125

注释

inference time: 1.2298469543457031
wav.numpy() time: 1.4255139827728271

进一步打印发现主要耗时是在 pwgan 上(9s pwgan 其实这么长时间是符合预期的,我现在反而更惊喜于注释掉这两行 pwgan 速度竟然变快了,应该是框架有改进, 我再 check 一下其他地方注释掉这两行是否都会变快,因为这两行起初就是因为框架在执行动转静的时候会申请大量的显存,如果不加很容易 OOM, 但是 voice cloning 的时候没有动转静,所以可以删掉)

有 FLAGS_fraction_of_gpu_memory_to_use=0.01 这个 FLAG voc 速度慢是符合预期的,因为这个配置要求,每次只申请1%的显存, 如果显存需要多的话需要慢慢申请,所以就比较耗时。

那么在用的时候确实推荐注释掉这两句比较好(我会在代码里面改一下,感谢您的发现),但是 wav.numpy() 耗时的问题可能还是需要框架的同学再看看是不是和这两句有关

FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
liukaiyueyuo commented 2 years ago

好的,希望你们框架的同学尽快看一下,发现问题打造更好的paddlepaddle

yt605155624 commented 2 years ago

我做了个实验

test.py

import time
import paddle
wav = paddle.ones([2582700, 1])
print("wav shape: %s" % wav.shape)
time_1 = time.time()
gpu2cpu_1 = wav.numpy()
time_2 = time.time()
print('1-2 time: {}'.format(time_2 - time_1))
wav2 = paddle.full_like(wav, 0.0)
time_3 = time.time()
gpu2cpu_2 = wav2.numpy()
time_4 = time.time()
print('3-4 time: {}'.format(time_4 - time_3))

test.sh

FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 test_flag.py

不注释 FLAG:

wav shape: [2582700, 1]
1-2 time: 0.007472991943359375
3-4 time: 0.007332324981689453

注释 FALG:

wav shape: [2582700, 1]
1-2 time: 0.0074002742767333984
3-4 time: 0.008489370346069336

所以可以得到一个结论,这两个 FLAG 不直接影响 tensor.numpy() 的速度,但是在跑预测代码时候,可能通过加载模型、跑模型影响到了显存的状态(上述代码几乎没用到显卡),所以影响到了 tensor.numpy() 的速度

paddle-bot[bot] commented 1 year ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。