Open mayowaosibodu opened 1 year ago
Hi @mayowaosibodu , BF16 is good to have but not necessary. E5-2690 v4 shows that it only supports AVX2, which caused the error.
However, you can still using the parallel inference speedup. Try to 1) disable bf16, 2) use parallel speedup by following steps
1) comment https://github.com/Spycsh/xtalker/blob/main/src/facerender/modules/make_animation.py#L140C12-L140C97 and https://github.com/Spycsh/xtalker/blob/main/src/facerender/animate.py#L78, and fix the indents.
2) Follow https://github.com/Spycsh/xtalker#acceleration-by-iomp, run python generate_distributed_infer.py --slot=7 --core=14 --driven_audio <path_to_your_audio_file> --source_image <path_to_your_source_image>
since your hardware have 14 physical cores. Then run bash run_distributed_infer_7.sh
.
I think it is a bit complicated for you and common users to use simple parallel speedup without using bf16, and I will do some fixes in these days and I will let you know when there is a fix.
Hi @mayowaosibodu , I've already made the fix. You can now simply run python generate_distributed_infer.py --slot=7 --driven_audio <path_to_your_audio_file> --source_image <path_to_your_source_image>
, and then run bash run_distributed_infer_7.sh
to get the speedup without bf16.
Welcome to any feedback:)
Hello @Spycsh, thanks for the response.
So I just tried out your suggestions. Commenting out those lines you mentioned led to some errors. For example commenting out kp_norm = kp_driving
in https://github.com/Spycsh/xtalker/blob/main/src/facerender/modules/make_animation.py#L140C12-L140C97 led to a kp_norm not defined
error.
I undid the comments and then followed the instructions in your second response. However I was still getting the bf16 related errors after running the command you mentioned. To run the command successfully I had to remove the --bf16
argument from the command. (Inserting the argument means you want to use bf16 right? And removing it means you don't?)
So now the inference runs. The issue I'm facing now is that it's slow. The Face Render stage for the same image and audio took about 1 minute with SadTalker. Running them with xtalker takes about 25 minutes (on just one worker).
Any idea what could be responsible for that?
Hi @mayowaosibodu , thanks for feedback. Yes, after pulling the new commit, it does not need to comment lines. Ignore my first answer. Your understanding in the second answer is correct. Using --bf16 is to use bf16 and in your case just remove that.
Now let's try to find why the speed is slow. Are you doing the experiments on the same machine? In other words, 1 minute with SadTalker and 25 minutes with xtalker is data from the same machine, right? Could you give me the time length of your driven audio?
If yes, then try to assign the --slot=1 when running the generate_distributed_infer.py, and then run run_distributed_infer_1.sh, which will fallback to only one process and no parallel execution. This case should be approximately the same speed as the naive SadTalker implementation.
If everything are correct so far, I suspect that because of the AVX2 limitation, it is not like AVX512 that is able to let single core to do high-dimensional matrix multiplication efficiently. To validate this idea, You can decrease the --slot=7, to maybe --slot=2, and run run_distributed_infer_2.sh. The speed I think should be between the naive SadTalker (same as --slot=1 one) and the --slot=7 one.
Also, remember to run bash gc.sh to clear all irrelevant intermediate results when there is an exception exit.
Thanks for feedback:)
运行了run_distributed_infer_7.sh之后显示
numactl: This system does not support NUMA policy
numactl: This system does not support NUMA policy
numactl: This system does not support NUMA policy
numactl: This system does not support NUMA policy
numactl: This system does not support NUMA policy
numactl: This system does not support NUMA policy
numactl: This system does not support NUMA policy
Hi @aiquanpeng , I think your platform does not enable numa. Please run numactl --hardware
and see whether it errors. I'm not sure whether you can enable numa through BIOS or some other ways on your platform because my testing environment has numa enabled by default. Maybe https://unix.stackexchange.com/questions/575470/numactl-this-system-does-not-support-numa-policy can help you a bit.
In case you do not have numactl, you can still play with another optimization, by passing a --bf16
to the command like python inference.py --driven_audio xxx.wav --source_image xxx.jpg --result_dir ./results --cpu --bf16
to observe the performance speedup.
Hello,
I recently came across this repo, and I think it's cool. I'm interested in running SadTalker inference at much faster speeds, and I'm curious about the speedups this repo provides.
I'm trying the inference on an Intel Xeon E5-2690 v4 (Broadwell) CPU, but I'm getting the above error.
Here's the general output after running the inference command:
using safetensor as default start to generate video... 1694444593.596219 device========= cpu ---------device----------- cpu 0000: Audio2Coeff 0.9642581939697266 Traceback (most recent call last): File "inference.py", line 217, in <module> main(args) File "inference.py", line 51, in main animate_from_coeff = AnimateFromCoeff(sadtalker_paths, device) File "/home/demo/xtalker/src/facerender/animate.py", line 78, in __init__ self.generator = ipex.optimize(self.generator, dtype=torch.bfloat16) File "/home/demo/.local/lib/python3.8/site-packages/intel_extension_for_pytorch/frontend.py", line 526, in optimize assert core.onednn_has_bf16_support(), \ AssertionError: BF16 weight prepack needs the cpu support avx512bw, avx512vl and avx512dq, please set dtype to torch.float or set weights_prepack to False.
Is the BF16 weight prepack important for the xtalker speedups?
If yes, does this mean xtalker can't run effectively on the Xeon E5-2690 v4 (Broadwell) CPU? What CPUs can it run on (in addition to the Xeon Sapphire Rapids CPU)?
I'm running xtalker on Azure VMs.