facebookresearch / sapiens

High-resolution models for human tasks.
https://about.meta.com/realitylabs/codecavatars/sapiens/
Other
4.53k stars 258 forks source link

Noise in the output depth map #151

Closed seagochen closed 1 month ago

seagochen commented 1 month ago

Hi, guys.

Thank you for your outstanding contribution to this project. I've noticed a problem with the depth map; white regular noise appears. I need your help to resolve this.

First, I pasted the simple source code here.

import torch
import torch.nn.functional as F
import numpy as np
import cv2
import matplotlib.pyplot as plt

def load_model(checkpoint, use_torchscript=True, device='cuda:0'):
    """Load the TorchScript model."""
    model = torch.jit.load(checkpoint, map_location=device) if use_torchscript else torch.load(checkpoint, map_location=device).module()
    model.eval()
    return model   

# Define the device
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

# Define the model path
# model_path = "checkpoints/depth/sapiens_0.3b_render_people_epoch_100_torchscript.pt2"
model_path = "checkpoints/depth/sapiens_0.6b_render_people_epoch_70_torchscript.pt2"

# Load the model
model = load_model(model_path, device=device)

# Modify shape based on your model
dummy_input = torch.randn(1, 3, 1024, 768).to(device)  

# Perform a forward pass with the dummy input
with torch.no_grad():
    output = model(dummy_input)  # The output is a tuple

# Print the length of the output
print(f"Output length: {len(output)}, dtype: {type(output)}")

# Print the input and output shapes
print(f"Input shape: {dummy_input.shape} dtype: {dummy_input.dtype}")
print(f"Output shape: {output.shape} dtype: {output.dtype}")

# Copy the data to the CPU
output = (output - output.min()) / (output.max() - output.min()) * 255
output = output.squeeze().cpu().numpy().astype(np.uint8)

# Display the output
plt.imshow(output, cmap='gray')
plt.axis('off')
plt.show()

And here is the output

スクリーンショット 2024-10-24 095923

Therefore, when I use this model to estimate the depth I got the real-output like this one

スクリーンショット 2024-10-24 100139

Whatever picture I use, there is always noise generated around the person. Your demo in the hugging face seems normal. Therefore, I think something is wrong I have made.

kke19 commented 1 month ago

Hi, s this because the image was resized to 768*1024 during input, and then resized back to the original resolution after the prediction was completed? In the current implementation, I've noticed that regardless of the input image size, it is forcibly resized to 768*1024.

seagochen commented 1 month ago

Hi, s this because the image was resized to 7681024 during input, and then resized back to the original resolution after the prediction was completed? In the current implementation, I've noticed that regardless of the input image size, it is forcibly resized to 7681024.

Thank you for your response. I think there is nothing to do with the resize because you can check my test code, and you will find that the shape is indeed (1, 3, 1024, 768). And I have changed the size to (1, 3, 768, 1024), and the noise still exists. (just like the first picture I pasted here)

As I used a dummy with random values as input, the test code was pasted there and is easy to read. I want to know whether my procedure is correct.

rawalkhirodkar commented 1 month ago

@seagochen Sapiens depth and normal estimators are only supervised on human pixels. In case of non-human pixels, the network predictions can be arbitrary. Although we have seen generalization of models to the backgrounds as well in few cases - however this is not consistent. In your case, inference with noise therefore can result in the grid artifacts due to deconv operations.