Closed yuedajiong closed 1 year ago
Agreed. In order to effectively simulate these components, particularly those that are non-rigid in nature, it is necessary to first decompose them from ECON's reconstruction and subsequently reparameterize them using appropriate 3D representations, such as strands for haircuts and open-surface for garments.
Strands for haircuts | Open-surface for garments |
I mainly focused on real people (real individuals like YuLiang, not just real humans), as my posted.
Yesterday, I tried your ECON by a two-dimensional anime game avtar 3d reconstruction: the user inputs are 1~4 image(s): fore, back and 2sides, compared with others: force-mesh-fit, shap-e, .... YES, your ECON is the best one so far, at least I got a basic human form. I think my test sample is out of your training data distribution, but your ECON still reconstructed a basic human form. very good.
But all of us are still far from a commercially usable algorithm.
For hair, there is a hard algorithm from nvidia, keyworks: nvidia hair interactive admm. It's very silky.
a 'real' human reconstuction, with body and key componets (fine-movment hand, silky-hair, eyeball, ...), and cloths/shoes/watch/glass/rings/..., still very diffucult.
I want to construct it step-by-step, component-by-component, but today, I think this is a road of no return, and endless.
maybe, still data-driving, implicit representation, e2e differentable including differentable-rendering, this way is more practicable.
the ultimate task definition is the most important, my bitter lesson; the task definition must be: e2e for anything, not only rigid, but also non-rigid face, muscle, even hair, smoke, water/waterfall, cloud, ...
dynamic (not only recon a dynamic object, but also infer a dynamic object by timestep-input) (nerf: k-planes) interactive (relatively blank research) ..... these should be included in the ultimate task-definition.
路漫漫其修远兮,大神敢去德国读博,听说很难,那就盼望你上下求索搞出一个: super-vision, unified reconstruct anything. Yes, i am on the way, tooooooooooooooooooo.
I summarized the essence of ECON, is this correct?
my bitter lesson: there is no-reconstuction, just conditional-generation. if we want to 3d construct anything, we must use priori BY DEFAULT. whether explicit(mesh, ...) based, or implicit(nerf, if, ..) based. otherwise, we're on an endless way.
我每天告诉自己:除了对逼真的真人,我可以搞一些特殊的手工的动作;其他任何对象,都不能case by case的手工搞。限制自己走一个短期的捷径出一点效果。
假设我们需要通过拆开表壳的几张照片,一些描述文字(包含三问等),或者其他controlnet这种的表述约束的形式,“构建(无论算法是重构还是生成)” 一只正在走时的瑞士钟表,最后的输出必须是,表的静态结构(含各个独立零件和相对咬合关系),和一秒一秒的时间戳就能够正常的运作(包含整体的物理约束正确,部件的互作关系。) ---- 这样一个相对很自动不太需要手工步骤的算法,可能就是那种比较 super-vision的unified, full-automatic, strong-constraint的。
可以想象的就是,最后这一步,内部应该是有迭代循环,一定要利用已经学得的先验(哪怕最简单的钟表,至少有几个齿轮盘),已经在各种场合学得了各种约束(比如60倍关系)。 现在大家都是构造的静态的相对简单的外观的东西为主,很难想象通过silhouette, rendered-rgb,和一些本质已经是先验体现的edge,normal,laplacian_smoothing等loss正则项,就能fit好精细的机械结构。
@YuliangXiu 大神,关于hair和garment,我折腾到现在,理解的总结:
hair:
现在经典的做法,要求低或者说质量不那么好的算法,很多用一个固定的比如半球,通过占用与否把脸露出来,相对静态的中端头发做法,输入图片得到发束信息,然后接一个细化的步骤从最后图像上欺骗一下眼睛。发束能够支持的数量其实不大,太大计算量很大,就算发束(比如每根10段100段,几千上万束),表现力也不够。 我看到的效果最好的nvidia的那个interactive hair admm那个。 只有简单的paper不开源,nvidia最近几年很多工作都不开源了。
我看中nvidia那个效果,而且还有交互式 。 但我还是不认为那是最好的,我个人觉得dynamic interacive implicit(nerf or hair-related implicit function?)可能更好,最后也能很好的视觉效果,可交互,动态(比如与头皮和风的互作)
garment: 现在我看到三类表示: 一类就是利用smpl这种,把顶点扯起来作为衣服,当然也可以增加一些顶点表示特别宽松的奇葩的衣服。 一类是在外面套一个标准的壳,用这个独立的mesh壳来表示衣服,其实表示能力有限。 最后还有一类也就是我最后选择的,就是真的针对不同的衣服,有一个独立的mesh. MGN这个算法也是。 第三种做法,在后续的运动中,驱动smpl, 然后要去随动计算garment的变化。 计算不好,就是我上面图片这种露肉。 当姿势很极端的时候,基于法向量为主的对garment的计算移动,其实很有挑战。 假设我们想针对一张照片,得到一个人物的静态的立体结构,含有衣服,然后做成一个手办雕塑,水晶雕塑,我觉得不区分身体和和衣服,是可以的。 但是,我们更需要的是那种和真实世界,至少现在好的立体游戏那种,希望身体和衣服互作,甚至可以看到角色脱下穿上衣服的换装过程,感觉最原始的任务定义,就得把二者给分开。
当然,超越这种具体得技术点的一个更大的问题是: 10年30年后,回过头来看,我们能object-by-object, component-by-component,step-by-step的去解决这么庞大的视觉世界的构造甚至操纵吗?
我的目标是super-ai,包括统一symbol&vision。我把mesh, nerf等这近5年的主流paper几百篇都看了,也一个一个的项目去折腾去分析开源弄了大几十个,然后也用了很人肉的方法,构建了: 照片克隆真人,文字视频驱动运动,(比如:刘亦菲在陆家嘴跳舞);在逼真度,整个功能地完成度上,也很领先。所以,我一边动手,一边不断地提出和否定自己,到底应该如何选择技术路线。
那种super-vision,很unified,到底应该什么样子?
我个人觉得,如果要面向那种终极的vision算法,走显式的路,行不通。 mesh类的export只应该是最后一个给UE等下游用的时候,为了快速渲染和驱动用。
其实, 特别涉及到“含时”和“交互”的时候,不太可能把复杂的对象和运动,用mesh+motion这种显式的预先表示处理。 我理解,应该是on-daemon按需的得到观察者某camera pose下,当前scene(interactive objects)下,我当前这个对象,shape应该什么样子,texture应该什么样子,然后尽力实时的得到2d的图像给当前位置;这个好比量子的波函数,观察者来看了,我就坍塌了show给你看。 如果非得要弄成mesh这种离散的,就算弄成UE那种分几个粗细程度,也没法办法得到隐式连续动态地表示那样的无限。
We recently released a new work, TeCH, it could produce detailed back-side surface with texture, which is consistent with the front-side.
Homepage: https://huangyangyi.github.io/TeCH/ Code: https://github.com/huangyangyi/TeCH
total:
my pipeline: (no my e2e unified) nude body with skin upper/under garments hair, head-move-follow-up
total logic: image -> face-shape+texture/body-shape -> body/face/hand reconstruct -> cloth-pick-and-dress & hair-style-match-and-generate -> text/video-guided motion-generation-for human-model -> cloth/hair follow-up -> dynamic-model and video
video capture: 衣服随动目前没有做太好,会露肉。