Spycsh / xtalker

Faster Talking Face Animation on Xeon CPU
MIT License
120 stars 9 forks source link

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. #3

Open diabolo98 opened 1 year ago

diabolo98 commented 1 year ago

Hello, i wanted to try your fork of sadtalker in colab, but I keep getting this error: --------device----------- cpu Traceback (most recent call last): File "inference.py", line 217, in <module> main(args) File "inference.py", line 43, in main preprocess_model = CropAndExtract(sadtalker_paths, device) File "/content/xtalker/src/utils/preprocess.py", line 49, in __init__ self.propress = Preprocesser(device) File "/content/xtalker/src/utils/croper.py", line 22, in __init__ self.predictor = KeypointExtractor(device) File "/content/xtalker/src/face3d/extract_kp_videos_safe.py", line 28, in __init__ self.detector = init_alignment_model('awing_fan',device=device, model_rootpath=root_path) File "/usr/local/lib/python3.8/dist-packages/facexlib/alignment/__init__.py", line 19, in init_alignment_model model.load_state_dict(torch.load(model_path)['state_dict'], strict=True) File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 713, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 930, in _legacy_load result = unpickler.load() File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 876, in persistent_load wrap_storage=restore_location(obj, location), File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 152, in _cuda_deserialize device = validate_cuda_device(location) File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 136, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I followed instructions, tried with the int8 branch and the main branch too. I tried making it work for hours but no success with and without GPU. Tried adding map_location=torch.device('cpu') on every instance of torch.load in inference.py.

i changed the installed python version from python 3.10 to 3.8 because it's required by sadtalker. i believe i tried with python 3.10 too but i doubt the problem is with the python version. Thanks in advance.

Spycsh commented 1 year ago

Hi @diabolo98 , this error lies in the facexlib dependencies, not in inference.py. Please in your /usr/local/lib/python3.8/dist-packages/facexlib/alignment/__init__.py , try edit the line

model.load_state_dict(torch.load(model_path)['state_dict'], strict=True)

to

model.load_state_dict(torch.load(model_path, map_location=device)['state_dict'], strict=True)
diabolo98 commented 1 year ago

Thanks, I would've never thought of changing the init file. It fixed the issue, but xtalker still wouldn't work.

After it solved the problem above, I had another issue related to intel_extension_for_pytorch. I checked the installation guide and I found out it needed to match the torch version, so I installed the correct version using !python3.8 -m pip install intel_extension_for_pytorch==2.0.0 -f https://developer.intel.com/ipex-whl-stable-cpu but I get the error that the CPU doesn't handle AVX2, which it does. Apparently this is a known problem and was only fixed in 2.0.1 of intel_extension_for_pytorch. I was able to bypass it by changing /usr/local/lib/python3.8/dist-packages/intel_extension_for_pytorch/cpu/_cpu_isa.py

then I got an error about the CPU not having AVX512 and friends, which i then "fixed?" by changing dtype from bfloat16 to float line 78 of /content/xtalker/src/facerender/animate.py now, I get this and honestly, I have no idea what it's about :

 device========= cpu
---------device----------- cpu
0000: Audio2Coeff
0.9539964199066162
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
0001: AnimateFromCoeff
5.883184432983398
3DMM Extraction for source image
landmark Det:: 100% 1/1 [00:04<00:00,  4.24s/it]
3DMM Extraction In Video:: 100% 1/1 [00:00<00:00,  5.08it/s]
0002: preprocess_model generate
11.044325828552246
eyeblick? pose?
None
None
mel:: 100% 1452/1452 [00:00<00:00, 32997.74it/s]
audio2exp:: 100% 146/146 [00:08<00:00, 17.53it/s]
0003: audio_to_coeff generate...
23.92700743675232
/content/2023_08_21_21.06.47/00963-1453378001##output_1.mat
rank, p_num: 0, 1
[kp_detector]:
1.2990715503692627
[mapping]:
0.021234512329101562
0.014824390411376953
Face Renderer::   0% 0/726 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "inference.py", line 217, in <module>
    main(args)
  File "inference.py", line 152, in main
    result = animate_from_coeff.generate(data, save_dir, pic_path, crop_info, \
  File "/content/xtalker/src/facerender/animate.py", line 178, in generate
    predictions_video = make_animation(source_image, source_semantics, target_semantics,
  File "/content/xtalker/src/facerender/modules/make_animation.py", line 141, in make_animation
    out = generator(source_image, kp_source=kp_source, kp_driving=kp_norm)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/xtalker/src/facerender/modules/generator.py", line 212, in forward
    out = self.first(source_image)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/xtalker/src/facerender/modules/util.py", line 255, in forward
    out = self.conv(x)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py", line 127, in forward
    return torch.ops.torch_ipex.convolution_forward(x, self.weight, self.bias, self.ctx.get_data_handle(), self.weight_size, self.padding, self.stride, self.dilation)
  File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: could not create a primitive descriptor for a convolution forward propagation primitive
diabolo98 commented 1 year ago

Here's the colab : https://colab.research.google.com/gist/diabolo98/dbffa21b297174058ca903763d45db8e/test-11111.ipynb

Spycsh commented 1 year ago

Hi @diabolo98 , I just noticed that you are doing the experiment on colab. I have not ever tested it on colab with cpu only. colab's cpus are limited. I just have a simple look on the cpu of colab and there are only 1 core with hyperthreading, which means the IOMP parallel optimization in xtalker should not be enabled, because it does not have parallel computation on only 1 physical core. Of course, you are also not enabling that yet because I find the rank, p_num: 0, 1 in log. So it doesn't matter.

However, I think only the BFloat16 optimization can still make some performance improvement (1~2x). As you mentioned, you found that intel_extension_for_pytorch==2.0.0 did not work, so have you tried install the latest intel_extension_for_pytorch==2.0.100? Please do not change the bfloat16 to float because we need bfloat16 to do acceleration.

Regarding to the last error RuntimeError: could not create a primitive descriptor for a convolution forward propagation primitive, that is an oneDNN error. This is probably caused by that you change bfloat16 to float, which is an invalid path that may conflict the oneDNN kernel size or stride. I recommend you to try the 2.0.100 ipex version firstly, keep bfloat16 and see whether the error exist.

diabolo98 commented 1 year ago

I completely forgot to add that I tried withintel_extension_for_pytorch==2.0.100 too, but it had the same AVX512 not found error. I guess there's no way to make it work on colab Since the CPU given by Google doesn't have avx512. I was originally just interested in the int8 conversion and didn't think everything was so tied together

Your help was deeply appreciated. Thanks again.

Spycsh commented 1 year ago

int8 optimization itself does not depend on intel extension for pytorch here so it will not raise the above errors. But again, I've not tested it yet in colab environment. Will do some experiments at weekend.

diabolo98 commented 1 year ago

I think colab CPU are too limited to be of any use here. I tried to use tortoise TTS that advertise a huge speed-up with half precision, deep speed and KV cache optimization, but on COLAB it says 4 hours for a few minutes of TTS so no matter how much you'll optimize xtalker it will probably still be unbearably slow. Nonetheless, I have a few questions :

Spycsh commented 1 year ago

Colab is truly limited in CPU because it tends to let you use GPU or TPU to do computing intensive work. My optimization is tested on Xeon Sapphire Rapids (check README), so I fully understand that you think it make no sense to do it on Colab.

I do not think the converted int8 model will work on GPU. My int8 optimization is based on https://github.com/intel/neural-compressor, which is mostly tested on Intel Xeon CPUs. I recommend you to try TensorRT if you really want to do int8 optimization on NV GPU.

I think Colab is enough for int8 conversion with CPU. However I guess it will be slow. You can share the converted model to HF if you want, but I think you should follow MIT license, which the original SadTalker has.

By the way, xtalker is irrelevant to TTS. You should not take the slow speed of any TTS component as the reason that xtalker will also be slow. BTW, I've tested tortoise TTS before. tortoise TTS is, as its name implies, SUPER slow. If you want to pipeline TTS+XTalker, try other TTS alternatives :)

diabolo98 commented 1 year ago

I usually try each component individually, so it is not a problem of using multiple models at once.

I know xtalker is irrelevant to TTS. I gave it as an example because they advertised a huge boost in speed on CPU, and so did bark TTS (or was it just bark, I tried both) but despite that they're still extremely slow when using the COLAB CPU. On the other hand, tortoise TTS recently became relatively fast on colab GPU and became very usable. I hoped the int8 would've improved the speed on GPU, but as you said it wouldn't work. Also, Sadly my knowledge about AI is extremely limited, so I don't think I can use TensorRT to do the int8 conversion for GPU unless it's a drag and drop process that require relatively simple adaptation and a lot of documentation reading. I think we've diverged a fair bit from the main issue and the purpose of this repo and and I apologize for that.

Spycsh commented 1 year ago

No problem, your feedbacks are welcome :)