Mael-zys / T2M-GPT

(CVPR 2023) Pytorch implementation of “T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations”
https://mael-zys.github.io/T2M-GPT/
Apache License 2.0
608 stars 52 forks source link

It is very difficult generate an FBX or mesh or SMPL file from the colab #8

Open poof420 opened 1 year ago

poof420 commented 1 year ago

Is it really possible to generate it? It doesn't seem to be working at all with the code provided. Thank you!

Mael-zys commented 1 year ago

Hello, here is another colab demo which can generate the SMPL file: https://colab.research.google.com/drive/1DGSHYtiWy8zDdyiQSgldq8VkxMIAu4Ql?usp=sharing

But it takes some time to install the environment. I will further work on the rendering part for colab demo.

poof420 commented 1 year ago

Okay I will try. Thank you. Yes if possible it would be great to do it in the official one more easily.

poof420 commented 1 year ago

Ah tried the colab demo you shared, but it seems that it gets stalled out in the training process.

CalledProcessError: Command 'b'source activate VQTrans\n\npython\nimport sys\nsys.argv = [\'GPT_eval_multi.py\']\nimport options.option_transformer as option_trans\nargs = option_trans.get_args_parser()\n\nargs.dataname = \'t2m\'\nargs.resume_pth = \'pretrained/VQVAE/net_last.pth\'\nargs.resume_trans = \'pretrained/VQTransformer_corruption05/net_best_fid.pth\'\nargs.down_t = 2\nargs.depth = 3\nargs.block_size = 51\nimport clip\nimport torch\nimport numpy as np\nimport models.vqvae as vqvae\nimport models.t2m_trans as trans\nimport warnings\nwarnings.filterwarnings(\'ignore\')\n\n## load clip model and datasets\nclip_model, clip_preprocess = clip.load("ViT-B/32", device=torch.device(\'cuda\'), jit=False, download_root=\'./\') # Must set jit=False for training\nclip_model.eval()\nfor p in clip_model.parameters():\n p.requires_grad = False\n\nnet = vqvae.HumanVQVAE(args, ## use args to define different parameters in different quantizers\n args.nb_code,\n args.code_dim,\n args.output_emb_width,\n args.down_t,\n args.stride_t,\n args.width,\n args.depth,\n args.dilation_growth_rate)\n\n\ntrans_encoder = trans.Text2Motion_Transformer(num_vq=args.nb_code, \n embed_dim=1024, \n clip_dim=args.clip_dim, \n block_size=args.block_size, \n num_layers=9, \n n_head=16, \n drop_out_rate=args.drop_out_rate, \n fc_rate=args.ff_rate)\n\n\nprint (\'loading checkpoint from {}\'.format(args.resume_pth))\nckpt = torch.load(args.resume_pth, map_location=\'cpu\')\nnet.load_state_dict(ckpt[\'net\'], strict=True)\nnet.eval()\nnet.cuda()\n\nprint (\'loading transformer checkpoint from {}\'.format(args.resume_trans))\nckpt = torch.load(args.resume_trans, map_location=\'cpu\')\ntrans_encoder.load_state_dict(ckpt[\'trans\'], strict=True)\ntrans_encoder.eval()\ntrans_encoder.cuda()\n\nmean = torch.from_numpy(np.load(\'./checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy\')).cuda()\nstd = torch.from_numpy(np.load(\'./checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy\')).cuda()\n\n# change the text here\nclip_text = ["a person runs in a circle and flails their arms"]\n\n\ntext = clip.tokenize(clip_text, truncate=True).cuda()\nfeat_clip_text = clip_model.encode_text(text).float()\nindex_motion = trans_encoder.sample(feat_clip_text[0:1], False)\npred_pose = net.forward_decoder(index_motion)\n\nfrom utils.motion_process import recover_from_ric\npred_xyz = recover_from_ric((pred_pose*std+mean).float(), 22)\nxyz = pred_xyz.reshape(1, -1, 22, 3)\n\nnp.save(\'motion.npy\', xyz.detach().cpu().numpy())\n\nimport visualization.plot_3d_global as plot_3d\npose_vis = plot_3d.draw_to_batch(xyz.detach().cpu().numpy(),clip_text, [\'example.gif\'])\n'' returned non-zero exit status 1.

Mael-zys commented 1 year ago

I've double checked the script, and it works for me. You can try it again or try the Hugging Face Space demo: https://huggingface.co/spaces/vumichien/generate_human_motion

clearsitedesigns commented 1 year ago

You all had the same question. I was also looking at several novel ways to generate motion-based FBX files, from a Nat Lang approach, video recognition, etc. This methodology would be a huge time saver. I was looking at the co-lab, but yeah not super familiar with there. The huge space is a bit more friendly to understand, but the mesh file is not an exportable output.