IDEA-Research / HumanTOMATO

[ICML 2024] 🍅HumanTOMATO: Text-aligned Whole-body Motion Generation
https://lhchen.top/HumanTOMATO
Other
240 stars 6 forks source link

text-motion alignment pre-trained model #12

Open Wenretium opened 2 months ago

Wenretium commented 2 months ago

Hi! I am very interested in your work, especially the text-motion alignment pre-trained model. Hope to see your model and codes soon.

Wenretium commented 2 months ago

Or please allow me to ask some questions about this part. Reading your paper, I found that your model design is very similar with the work TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis. Then, can I say that your model is based on it, adding the ability to support hand motions and replacing MPNet with sBERT? Did you train on TMR's code framework?

LinghaoChan commented 2 months ago

@Wenretium Thanks for your interest! Your comment is really in-depth and insightful.

Most of the answers are right. However, we do not train our model in TMR's framework. We implemented the model by ourselves before they released codes. We plan to release these codes in about 2 weeks. You can try the demo at first.

Best,

Ling-Hao CHEN

Wenretium commented 2 months ago

I get it. Thanks for your quick reply!

LinghaoChan commented 1 month ago

Hi @Wenretium !

We release the TMR training in the ./OpenTMA. Please check it!

[Note]: As the research target in this project is to clarify how to use text-motion alignment, TMR in our project is charged as TMA in the ICML-24 version.

Wenretium commented 1 month ago

Thank you very much! You provided very detailed code documentation.

LinghaoChan commented 1 month ago

Thank you very much! You provided very detailed code documentation.

@Wenretium welcome. any question, feel free to discuss!

Wenretium commented 1 month ago

Hello! I have another question. Since you didn't provide a full demo for text-motion alignment loss, I added it based on my own understanding.

# Load text and motion data
import torch
import torch.nn.functional as f
import numpy as np
from os.path import join as pjoin
from transformers import AutoTokenizer, AutoModel
from tma.models.architectures.temos.textencoder.distillbert_actor import DistilbertActorAgnosticEncoder
from tma.models.architectures.temos.motionencoder.actor import ActorAgnosticEncoder
from sentence_transformers import SentenceTransformer
from collections import OrderedDict

modelpath = 'distilbert-base-uncased'

textencoder = DistilbertActorAgnosticEncoder(modelpath, num_layers=4)
motionencoder = ActorAgnosticEncoder(nfeats=263, vae = True, num_layers=4)

"""
load model here
You need to normalize the motion data with mean and std.
For motionx, they are stored in './deps/t2m/motionx/vector_623/Comp_v6_KLD01/meta/*.npy'
"""
# # loading state dict

state_dict = torch.load('humanml3d.ckpt', map_location="cpu")["state_dict"]

from collections import OrderedDict
textencoder_dict = OrderedDict()
for k, v in state_dict.items():
    if k.split(".")[0] == "textencoder":
        name = k.replace("textencoder.", "")
        textencoder_dict[name] = v
textencoder.load_state_dict(textencoder_dict, strict=True)

motionencoder_dict = OrderedDict()
for k, v in state_dict.items():
    if k.split(".")[0] == "motionencoder":
        name = k.replace("motionencoder.", "")
        motionencoder_dict[name] = v
motionencoder.load_state_dict(motionencoder_dict, strict=True)

text = ["a person wonders in an oval path and ends where he started"]
motion = np.load('/path/to/HumanML3D/new_joint_vecs/000014.npy')
mean = np.load(pjoin('/path/to/Comp_v6_KLD01/meta', 'mean.npy'))
std = np.load(pjoin('/path/to/Comp_v6_KLD01/meta', 'std.npy'))
motion = (motion - mean) / std
motion = torch.Tensor(motion).unsqueeze(0)
lengths = [motion.shape[1]]
text_emb = textencoder(text).loc
motion_emb = motionencoder(motion, lengths).loc     
# print(textencoder(text_emb)
# print(motionencoder(motion_emb)
print(torch.mean(text_emb - motion_emb))

My question is: Did I load the pretrained model correctly? In HumanTOMATO, did you calculate the text-motion alignment loss by 'torch.mean(text_emb - motion_emb)'?

LinghaoChan commented 1 month ago

@Wenretium Thanks for the reminder. I detail the implementation here.

infoloss = InfoNCE(0.1)
filter_model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')
# if TMA supervision 
if args.supervision:
    # generated motions
    all_supervise_motion = torch.cat(gen_supervise_tensor_list, dim = 0).cuda()
    # motion length = token length * 4 (due to upsampling rate is 4)
    full_m_tokens_len = (m_tokens_len.detach() * 4).tolist()
    # get TMR_motion_embedding
    TMR_motion_embedding = t2m_TMR_motionencoder(all_supervise_motion, full_m_tokens_len).loc
    # get TMR_text_embedding
    TMR_text_embedding = t2m_TMR_textencoder(texts).loc
    with torch.no_grad():
        text_embedding = filter_model.encode(texts)
        text_embedding = torch.tensor(text_embedding).cuda()
        normalized = F.normalize(text_embedding, p=2, dim=1)
        # cos similarity
        emb_dist = normalized.matmul(normalized.T)
    loss_infonce = infoloss((TMR_motion_embedding, TMR_text_embedding), emb_dist)

    all_loss = loss_cls + args.lambdainfo * loss_infonce

Welcome any question.