I also wrote a script for aligning CLIP features using the 🤗 openai/clip-vit-base-patch32 encoder.
The initial pose is come from get_canonical_pose at utils.mujoco_utils, and I use Adam to optimize it.
The loss function is dot product between language and image embeddings
loss = -torch.matmul(image_embedding, text_features.T.detach())
I've noticed some oddities. Firstly, the loss function starts off much lower than what was reported in the paper. On the webpage, I saw that the initial error value was around -24, but my reproduction yields a value below -30. I suspect this has something to do with the prompts. Secondly, it's difficult to optimize to the desired pose.
Therefore, I would like to know more about the implementation details regarding this part, such as the optimizer settings or if any additional tricks were used, etc. Could you please share that with me?
hi qrcat! thank you for your interest in our paper
centering the hand in the image matters a lot, huggingface's clip automatically resizes images to square so also make sure you aren't passing a long rectangle
we found the optimization process to be a bit sensitive to learning rates, so i suggest playing around with those as well
yes, the prompt matters, i suggest starting out with what we included in the paper
I'm trying to reimplement the "Text to Robot Pose with CLIP" of paper but haven't achieved the same results.
I've been attempting to match the conditions outlined in the paper. When training the shadow hand model, I used:
I also wrote a script for aligning CLIP features using the 🤗 openai/clip-vit-base-patch32 encoder. The initial pose is come from
get_canonical_pose
atutils.mujoco_utils
, and I use Adam to optimize it. The loss function is dot product between language and image embeddingsI've noticed some oddities. Firstly, the loss function starts off much lower than what was reported in the paper. On the webpage, I saw that the initial error value was around -24, but my reproduction yields a value below -30. I suspect this has something to do with the prompts. Secondly, it's difficult to optimize to the desired pose.
Therefore, I would like to know more about the implementation details regarding this part, such as the optimizer settings or if any additional tricks were used, etc. Could you please share that with me?
I'm looking forward to receiving your reply.