Open patricklabatut opened 1 year ago
Can you give an expected timeframe on when depth estimation will be available?
I'd also be interested to hear.
I'd also be happy if you could share the semantic segmentation heads. The one that produces the results on the web demo. Thx!
Would be excellent to obtain depth estimation output per image. Supportive of this enhancement!
segmentation head similar to the demo please
Also interested in acquiring depth info per image, really cool!
Also very interested to have the depth estimation head model documentation (and model/weights if possible).
@patricklabatut Thank you so much for the main code. Would you please update us about the timeline of delivering the depth-estimation code as well. Please let us know if any help is needed.
Could you please release the segmentation part?
Could you please release the segmentation part?
Very interested and waiting for your release!
Cool!
very interested in releasing the depth estimation head
Interested in depth estimation head as well (or any documentation on how to reproduce the results using provided models)
Interested in the depth part also!
@patricklabatut could you maybe shed some light on the decision to not release the depth estimation parts immediately? I'm not much into deep learning research, but if you trained and tested it, is it a lot of effort to just publish it? Or am I to naive?
@patricklabatut amazing work! any approximate timeline on if/when a trained depth estimation head could be released?
I would love to learn the news about the depth
I would also appreciate an example code for depth estimation. Can't do much with the model's output embeddings yet. Thanks!
Very interested in the depth estimation code! I tried to add linear head but actually I don't know how to convert the (batch_size, num_of_tokens, feature_dim) tensor to (batch_size, 256 image_width, image_height) to get the paper's result on SUNRGBD.
Would appreciate greatly if your pre-trained depth estimator/optical flow model is released! Can't wait to try it on my videos!
Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!
Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).
Would be awesome if someone train a depth estimation head on top of the provided backbone (dinov2_vitl14_pretrain.pth). Any thoughts on who/how and estimated eta?
I would also like to request an estimated release date for the depth estimation pre-train head. Thank you.
Two questions about the "DPT decoder" mentioned in 7.3 Dense Recognition Tasks-Depth estimation part. I search for the DPT source code, do the "DPT decoder" refers to its refinenet? If yes, I'm curious on why you choose this decoder . Thank you!
@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper
@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper
Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper.
@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper
Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper.
Hi how much RMSE did you get for depth estimation with DPT decoder? For NYUv2 or SUNRGBD? I am really interested in the results. Thank you very much! @emojilearning
Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation.
I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe.
I am basing my experiments on this part describing the simplest setup lin . 1
for ViT-B/14 which requires training a single linear layer on top of the frozen final layer's output
lin. 1: we extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token. Then we bi-linearly upsample the tokens by a factor of 4 to increase the resolution. Finally we train a simple linear layer using a classification loss by dividing the depth prediction range in 256 uniformly distributed bins and use a linear normalization following Bhat et al. (2021).
Below i detail my attempt based on the details provided in the paper:
Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth.
Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of batch x 1536 x 33 x 44
import torch
import einops as E
vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda()
ret = vit.forward_features(image)
patch_tok = ret["x_norm_patchtokens"]
cls_tok = ret["x_norm_clstoken"]
_, _, img_h, img_w = image.shape
patch_h, patch_w = img_h / 14, img_w / 14
patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w)
cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h)
output = torch.cat((patch_tok, cls_tok), dim=1)
Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance.
Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better.
Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much.
As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you!
Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation.
I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe.
I am basing my experiments on this part describing the simplest setup
lin . 1
for ViT-B/14 which requires training a single linear layer on top of the frozen final layer's outputlin. 1: we extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token. Then we bi-linearly upsample the tokens by a factor of 4 to increase the resolution. Finally we train a simple linear layer using a classification loss by dividing the depth prediction range in 256 uniformly distributed bins and use a linear normalization following Bhat et al. (2021).
Below i detail my attempt based on the details provided in the paper:
Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth.
Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of
batch x 1536 x 33 x 44
import torch import einops as E vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda() ret = vit.forward_features(image) patch_tok = ret["x_norm_patchtokens"] cls_tok = ret["x_norm_clstoken"] _, _, img_h, img_w = image.shape patch_h, patch_w = img_h / 14, img_w / 14 patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w) cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h) output = torch.cat((patch_tok, cls_tok), dim=1)
Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance.
Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better.
Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much.
As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you!
Hi @mbanani, thanks for sharing research details. I also concentrate on depth estimation task based on dinov2 backbone and obtained an unexpected result.
for the simplest setup lin. 1 stated in the paper,
firstly, I used the kitti dataset. for data preprocess, i just slightly resize the origin RGB image to satisfy "height(or width) % 14 == 0",
while the dense depth groundtruth was resized using 'nearest' mode.
I totally agree with the step of Feature Extraction you described.
for Depth estimation, I think the vision transformer backbone used in dinov2 naturally provide a spatially low-resolution feature,
but with more embedding dimensions. I was also confused is there any operations to rescale the features to original image size instead of directly upsample by 4 and successively by 3.5. I tried the Unet decoder structure (no concat in my case), with successively upsampling by 2, 2, 2 and 1.75. between the two upsample blocks, conv2d was used to extract features and change the embedding dimension. Finally, the linear head was trained as a regression task using scale invariant loss.
However, at the inference stage, the estimated depth (the selected image also from kitti) was unexpected. Especially for the scene where many cars parked on the side road.
Above is my experience and opinion, thank you
@
Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!
Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).
when a trained depth estimation head could be released?
@
Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!
Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).
when a trained depth estimation head could be released?
Same quest here. I would really appreciate it if a depth estimation head is available.
same here.
Hi folks,
Just added support for DPT + DINOv2 in 🤗 Transformers: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DPT/DPT_inference_notebook_(depth_estimation).ipynb.
We've extended the DPT model (which is one of the best depth estimation decoders) to now also leverage DINOv2 as backbone. It can be created as follows:
from transformers import Dinov2Config, DPTConfig, DPTForDepthEstimation
backbone_config = Dinov2Config.from_pretrained("facebook/dinov2-base", out_features=["stage1", "stage2", "stage3", "stage4"]
config = DPTConfig(backbone_config=backbone_config)
model = DPTForDepthEstimation(config)
Transferred all checkpoints to the hub: https://huggingface.co/models?pipeline_tag=depth-estimation&other=dinov2&sort=trending.
@NielsRogge thanks for the support!
Question ~ if I already have DINOv2 embeddings extracted, is there a way for me to run them through the depth estimation portion only?
Hi @palol, yes that's possible, you could do it as follows:
from transformers import DPTForDepthEstimation
model = DPTForDepthEstimation.from_pretrained("facebook/dpt-dinov2-small-kitti")
# note: we need to set a certain height and width (this is normally the height and width of the image passed to the model)
height = width = 518
patch_size = model.config.backbone_config.patch_size
patch_height = height // patch_size
patch_width = width // patch_size
hidden_states = model.neck(dino_features, patch_height, patch_width)
predicted_depth = model.head(hidden_states)
Note that the dino_features
here need to be a list of 4 feature maps extracted from a DINOv2-small model in this case (as we're loading facebook/dpt-dinov2-small-kitti from the hub), across the 4 stages that correspond to the small one (which is stage [3, 6, 9, 12]). This is because the DPT head uses feature maps/embeddings from 4 different layers of DINOv2.
@NielsRogge thanks for the solution. So this means that enough of the backbone has to be preserved to follow the "lin. 4" protocol. Do you have any support for the "lin. 1" protocol, that only uses the last layer of the frozen transformer?
Related issues:
6
14
46
97