Some doubts after testing Champ

fudan-generative-vision / champ

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

https://fudan-generative-vision.github.io/champ/

MIT License

4.76k stars 598 forks source link

Some doubts after testing Champ #72

Open ZZfive opened 7 months ago

ZZfive commented 7 months ago

Without data preprocessing, a random picture is used as ref_image and the provided motion_6 for inference. The result is as follow. The consistency of the character's movements is very good, but the character's face is greatly damaged. It should be due to the lack of preprocessing and the human body information in ref_image and the figure in motion are not aligned.

https://github.com/fudan-generative-vision/champ/assets/57706634/c1e0a4f4-e7df-4147-9a0e-6e751ed97399

Because the paper mentioned that champ was tested on the UBC fashion dataset, in order to test the data preprocessing process, the following video was selected as the guidance motion from the UBC fashion dataset.

https://github.com/fudan-generative-vision/champ/assets/57706634/40b1f05c-53df-4be1-9b73-1d00f8faca1f

Based on the data preprocessing doc, after completing the environment setup, the required depth, normal, semantic_map and dwpose features can be successfully obtained from the motion guidance video. But I encountered a problem. The obtained semantic_map was missing two frames for some reason. Have you encountered this during data preprocessing? Because the 14s motion guidance video has a total of 422 frames, the difference between the two frames before and after is small. For the two missing frames in semantic_map, directly copy the previous frame to supplement.

In the figure below, the left side is the first frame of the guidance motion video, its size is 960×1254. The right side is the reference image, its size is 451×677. The middle is the depth in the first frame of the guidance motion video after data preprocessing, you can see that the image size is aligned to 451×677, and the human body parts are also more consistent.

However, using the preprocessed data based on the above reference image and guidance motion video for inference, the result is very bad, as shown below. There is a lot of jitter in the video, and there are serious distortions in the faces and bodies of the characters.

https://github.com/fudan-generative-vision/champ/assets/57706634/ff679937-b90c-4a5f-b24e-17f59ce04f37

Can somebody tell me the reason for the poor performance or provide some suggestions for improvement? Thanks

zhou-linpeng commented 7 months ago

Without data preprocessing, a random picture is used as ref_image and the provided motion_6 for inference. The result is as follow. The consistency of the character's movements is very good, but the character's face is greatly damaged. It should be due to the lack of preprocessing and the human body information in ref_image and the figure in motion are not aligned.

grid_wguidance.mp4 Because the paper mentioned that champ was tested on the UBC fashion dataset, in order to test the data preprocessing process, the following video was selected as the guidance motion from the UBC fashion dataset.

91D23ZVV6NS.mp4 Based on the data preprocessing doc, after completing the environment setup, the required depth, normal, semantic_map and dwpose features can be successfully obtained from the motion guidance video. But I encountered a problem. The obtained semantic_map was missing two frames for some reason. Have you encountered this during data preprocessing? Because the 14s motion guidance video has a total of 422 frames, the difference between the two frames before and after is small. For the two missing frames in semantic_map, directly copy the previous frame to supplement.

In the figure below, the left side is the first frame of the guidance motion video, its size is 960×1254. The right side is the reference image, its size is 451×677. The middle is the depth in the first frame of the guidance motion video after data preprocessing, you can see that the image size is aligned to 451×677, and the human body parts are also more consistent.

However, using the preprocessed data based on the above reference image and guidance motion video for inference, the result is very bad, as shown below. There is a lot of jitter in the video, and there are serious distortions in the faces and bodies of the characters.

animation.mp4 Can somebody tell me the reason for the poor performance or provide some suggestions for improvement? Thanks

Can you show your grid_wguidance.mp4 results, I think it's your conditional map flickering that causes results like this

ZZfive commented 7 months ago

Without data preprocessing, a random picture is used as ref_image and the provided motion_6 for inference. The result is as follow. The consistency of the character's movements is very good, but the character's face is greatly damaged. It should be due to the lack of preprocessing and the human body information in ref_image and the figure in motion are not aligned. grid_wguidance.mp4 Because the paper mentioned that champ was tested on the UBC fashion dataset, in order to test the data preprocessing process, the following video was selected as the guidance motion from the UBC fashion dataset. 91D23ZVV6NS.mp4 Based on the data preprocessing doc, after completing the environment setup, the required depth, normal, semantic_map and dwpose features can be successfully obtained from the motion guidance video. But I encountered a problem. The obtained semantic_map was missing two frames for some reason. Have you encountered this during data preprocessing? Because the 14s motion guidance video has a total of 422 frames, the difference between the two frames before and after is small. For the two missing frames in semantic_map, directly copy the previous frame to supplement. In the figure below, the left side is the first frame of the guidance motion video, its size is 960×1254. The right side is the reference image, its size is 451×677. The middle is the depth in the first frame of the guidance motion video after data preprocessing, you can see that the image size is aligned to 451×677, and the human body parts are also more consistent. However, using the preprocessed data based on the above reference image and guidance motion video for inference, the result is very bad, as shown below. There is a lot of jitter in the video, and there are serious distortions in the faces and bodies of the characters. animation.mp4 Can somebody tell me the reason for the poor performance or provide some suggestions for improvement? Thanks

Can you show your grid_wguidance.mp4 results, I think it's your conditional map flickering that causes results like this

Since the reference image used was not found to be an RGBA image, error happened due to a size mismatch when saving grid_wguidance.mp4. Therefore, grid_wguidance.mp4 could not be provided as above. I just discovered this problem. After converting the reference image to RGB, got grid_wguidance.mp4, as shown below. As you guessed, grid_wguidance.mp4 has serious flickering. I follow the doc to perform the data preprocessing process. What problems may cause the flickering in the condition map?

https://github.com/fudan-generative-vision/champ/assets/57706634/a8a2a6eb-4df9-4c49-b3bd-ce86364982d8

zhanghongyong123456 commented 7 months ago

Without data preprocessing, a random picture is used as ref_image and the provided motion_6 for inference. The result is as follow. The consistency of the character's movements is very good, but the character's face is greatly damaged. It should be due to the lack of preprocessing and the human body information in ref_image and the figure in motion are not aligned. grid_wguidance.mp4 Because the paper mentioned that champ was tested on the UBC fashion dataset, in order to test the data preprocessing process, the following video was selected as the guidance motion from the UBC fashion dataset. 91D23ZVV6NS.mp4 Based on the data preprocessing doc, after completing the environment setup, the required depth, normal, semantic_map and dwpose features can be successfully obtained from the motion guidance video. But I encountered a problem. The obtained semantic_map was missing two frames for some reason. Have you encountered this during data preprocessing? Because the 14s motion guidance video has a total of 422 frames, the difference between the two frames before and after is small. For the two missing frames in semantic_map, directly copy the previous frame to supplement. In the figure below, the left side is the first frame of the guidance motion video, its size is 960×1254. The right side is the reference image, its size is 451×677. The middle is the depth in the first frame of the guidance motion video after data preprocessing, you can see that the image size is aligned to 451×677, and the human body parts are also more consistent. However, using the preprocessed data based on the above reference image and guidance motion video for inference, the result is very bad, as shown below. There is a lot of jitter in the video, and there are serious distortions in the faces and bodies of the characters. animation.mp4 Can somebody tell me the reason for the poor performance or provide some suggestions for improvement? Thanks

Can you show your grid_wguidance.mp4 results, I think it's your conditional map flickering that causes results like this

Since the reference image used was not found to be an RGBA image, error happened due to a size mismatch when saving grid_wguidance.mp4. Therefore, grid_wguidance.mp4 could not be provided as above. I just discovered this problem. After converting the reference image to RGB, got grid_wguidance.mp4, as shown below. As you guessed, grid_wguidance.mp4 has serious flickering. I follow the doc to perform the data preprocessing process. What problems may cause the flickering in the condition map?

grid_wguidance.mp4

My results are also particularly flashing. This is my grid..mp4

https://github.com/fudan-generative-vision/champ/assets/48466610/3d20754d-5a06-4c98-8765-4bd5736cce91

faiimea commented 7 months ago

I followed the data_process process for each step, and both the background flicker and the facial distortion appeared in my generated video. At the same time, I used the transferd_result processed by data_process and the reference image provided in the source code for video generation, and the above problems also occurred. I suspect that the alignment of the video with the image may be causing the problem.

I want to know if there is any way to solve the facial distortion problem and the flicker of the background. Also, I want to ask what images are stored under the 'champ/transferd_result/visualized_imgs' path. Now what I observe is a superposition of normal_image and reference image, but I don't know what that means. Please let me know if I did something wrong that caused the visualized_imgs error.

zhou-linpeng commented 7 months ago

Without data preprocessing, a random picture is used as ref_image and the provided motion_6 for inference. The result is as follow. The consistency of the character's movements is very good, but the character's face is greatly damaged. It should be due to the lack of preprocessing and the human body information in ref_image and the figure in motion are not aligned. grid_wguidance.mp4 Because the paper mentioned that champ was tested on the UBC fashion dataset, in order to test the data preprocessing process, the following video was selected as the guidance motion from the UBC fashion dataset. 91D23ZVV6NS.mp4 Based on the data preprocessing doc, after completing the environment setup, the required depth, normal, semantic_map and dwpose features can be successfully obtained from the motion guidance video. But I encountered a problem. The obtained semantic_map was missing two frames for some reason. Have you encountered this during data preprocessing? Because the 14s motion guidance video has a total of 422 frames, the difference between the two frames before and after is small. For the two missing frames in semantic_map, directly copy the previous frame to supplement. In the figure below, the left side is the first frame of the guidance motion video, its size is 960×1254. The right side is the reference image, its size is 451×677. The middle is the depth in the first frame of the guidance motion video after data preprocessing, you can see that the image size is aligned to 451×677, and the human body parts are also more consistent. However, using the preprocessed data based on the above reference image and guidance motion video for inference, the result is very bad, as shown below. There is a lot of jitter in the video, and there are serious distortions in the faces and bodies of the characters. animation.mp4 Can somebody tell me the reason for the poor performance or provide some suggestions for improvement? Thanks

Can you show your grid_wguidance.mp4 results, I think it's your conditional map flickering that causes results like this

Since the reference image used was not found to be an RGBA image, error happened due to a size mismatch when saving grid_wguidance.mp4. Therefore, grid_wguidance.mp4 could not be provided as above. I just discovered this problem. After converting the reference image to RGB, got grid_wguidance.mp4, as shown below. As you guessed, grid_wguidance.mp4 has serious flickering. I follow the doc to perform the data preprocessing process. What problems may cause the flickering in the condition map?

grid_wguidance.mp4

https://github.com/fudan-generative-vision/champ/assets/147801386/925d5f23-f6a4-4171-8e2e-c7076f1ec809

Here is my result, you can apply some deflicker methods to your condition maps

ZZfive commented 7 months ago

Without data preprocessing, a random picture is used as ref_image and the provided motion_6 for inference. The result is as follow. The consistency of the character's movements is very good, but the character's face is greatly damaged. It should be due to the lack of preprocessing and the human body information in ref_image and the figure in motion are not aligned. grid_wguidance.mp4 Because the paper mentioned that champ was tested on the UBC fashion dataset, in order to test the data preprocessing process, the following video was selected as the guidance motion from the UBC fashion dataset. 91D23ZVV6NS.mp4 Based on the data preprocessing doc, after completing the environment setup, the required depth, normal, semantic_map and dwpose features can be successfully obtained from the motion guidance video. But I encountered a problem. The obtained semantic_map was missing two frames for some reason. Have you encountered this during data preprocessing? Because the 14s motion guidance video has a total of 422 frames, the difference between the two frames before and after is small. For the two missing frames in semantic_map, directly copy the previous frame to supplement. In the figure below, the left side is the first frame of the guidance motion video, its size is 960×1254. The right side is the reference image, its size is 451×677. The middle is the depth in the first frame of the guidance motion video after data preprocessing, you can see that the image size is aligned to 451×677, and the human body parts are also more consistent. However, using the preprocessed data based on the above reference image and guidance motion video for inference, the result is very bad, as shown below. There is a lot of jitter in the video, and there are serious distortions in the faces and bodies of the characters. animation.mp4 Can somebody tell me the reason for the poor performance or provide some suggestions for improvement? Thanks

Can you show your grid_wguidance.mp4 results, I think it's your conditional map flickering that causes results like this

Since the reference image used was not found to be an RGBA image, error happened due to a size mismatch when saving grid_wguidance.mp4. Therefore, grid_wguidance.mp4 could not be provided as above. I just discovered this problem. After converting the reference image to RGB, got grid_wguidance.mp4, as shown below. As you guessed, grid_wguidance.mp4 has serious flickering. I follow the doc to perform the data preprocessing process. What problems may cause the flickering in the condition map? grid_wguidance.mp4

grid_wguidance_anyone.mp4 Here is my result, you can apply some deflicker methods to your condition maps

What deflicker methods can i try? Can you tell me?

faiimea commented 7 months ago

The first video uses the ref-07.png and motion-02 The second video uses the ref-07.png and processed video

https://github.com/fudan-generative-vision/champ/assets/87272252/f153a3fe-4181-40b7-8d36-cd47d3d0e122

https://github.com/fudan-generative-vision/champ/assets/87272252/85313a6f-bf2e-4b4b-b006-a07fcf669dea

And the face distortion like this:

subazinga commented 7 months ago

We will release a SMPL smoothing feature soon, maybe this week, to solve the flicker problem.