alex-moon / vc

MIT License
22 stars 6 forks source link

DPT depth model #33

Closed alex-moon closed 2 years ago

alex-moon commented 2 years ago

At present, the DPT hybrid model gives us this error:

ipdb> self.patch_embed.backbone(x)
*** RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

ipdb> self.patch_embed.proj(x).flatten(2).transpose(1, 2)
*** RuntimeError: Given groups=1, weight of size [768, 1024, 1, 1], expected input[384, 384, 1, 3] to have 1024 channels, but got 384 channels instead

This is in vc/service/helper/midas/midas/vit.py

More details:


ipdb> self.__class__
<class 'timm.models.vision_transformer.VisionTransformer'>
ipdb> self.patch_embed.__class__
<class 'timm.models.vision_transformer_hybrid.HybridEmbed'>
ipdb> self.patch_embed.backbone.__class__
<class 'timm.models.resnetv2.ResNetV2'>

This is certainly a problem with torch.nn.Conv2d which calls are defined in DPTDepthModel.__init__

Can confirm we get exactly the same error if we use the DPT codebase itself:

https://github.com/isl-org/DPT

I reckon this is liable to be a torch version problem.

alex-moon commented 2 years ago

For completeness: if we modify vc/service/helper/midas/run.py to add channels_last=True to the instantiation of DPTDepthModel for dpt_hybrid, we get the same error.

alex-moon commented 2 years ago

OK got it - as per https://github.com/isl-org/DPT/issues/37 the solution is to downgrade timm. Fine.