Our model is trained to predict from exactly two images, but we find that in practice if you duplicate your single-view image and feed both copies of the image into the network, then it works well. This is what we do in the demo when you upload one image, and I believe this is similar to how the DUSt3R/MASt3R authors handle the single-view case.
If you are particularly interested in the single-view case, you might want to have a look at Flash3D, which is trained for single-view inference and is based on monocular depth.
Our model is trained to predict from exactly two images, but we find that in practice if you duplicate your single-view image and feed both copies of the image into the network, then it works well. This is what we do in the demo when you upload one image, and I believe this is similar to how the DUSt3R/MASt3R authors handle the single-view case.
If you are particularly interested in the single-view case, you might want to have a look at Flash3D, which is trained for single-view inference and is based on monocular depth.