I went the other two ways: 1)unified e2e, data driven for priori as constriant. 2) pipeline, but separate clothes even hair etc.

yuedajiong commented 1 year ago

total:

Although the result is OK so far, but the heavy manual route, how far can we go?
not considered that the separation of different components: human(body, face, hand, foot, eyeball, ...) accessory (cloth, hair, shoe, ...), it is very difficult for downstream tasks, even dynamic, interactive, even dynamic hair,...
from a distant perspective of technological choice, the ECON compared to Conditional Generation(image-as-condition), most important is the introduction of SMPL as a human priori. from a close perperctive, use normal for cloth and fit with smple for human-shape.

my pipeline: (no my e2e unified) nude body with skin upper/under garments hair, head-move-follow-up

total logic: image -> face-shape+texture/body-shape -> body/face/hand reconstruct -> cloth-pick-and-dress & hair-style-match-and-generate -> text/video-guided motion-generation-for human-model -> cloth/hair follow-up -> dynamic-model and video

video capture: 衣服随动目前没有做太好，会露肉。

YuliangXiu commented 1 year ago

Agreed. In order to effectively simulate these components, particularly those that are non-rigid in nature, it is necessary to first decompose them from ECON's reconstruction and subsequently reparameterize them using appropriate 3D representations, such as strands for haircuts and open-surface for garments.


Strands for haircuts	Open-surface for garments

yuedajiong commented 1 year ago

I mainly focused on real people (real individuals like YuLiang, not just real humans), as my posted.

Yesterday, I tried your ECON by a two-dimensional anime game avtar 3d reconstruction: the user inputs are 1~4 image(s): fore, back and 2sides, compared with others: force-mesh-fit, shap-e, .... YES, your ECON is the best one so far, at least I got a basic human form. I think my test sample is out of your training data distribution, but your ECON still reconstructed a basic human form. very good.

But all of us are still far from a commercially usable algorithm.

For hair, there is a hard algorithm from nvidia, keyworks: nvidia hair interactive admm. It's very silky.

a 'real' human reconstuction, with body and key componets (fine-movment hand, silky-hair, eyeball, ...), and cloths/shoes/watch/glass/rings/..., still very diffucult.

I want to construct it step-by-step, component-by-component, but today, I think this is a road of no return, and endless.
maybe, still data-driving, implicit representation, e2e differentable including differentable-rendering, this way is more practicable.

the ultimate task definition is the most important, my bitter lesson; the task definition must be: e2e for anything, not only rigid, but also non-rigid face, muscle, even hair, smoke, water/waterfall, cloud, ...

dynamic (not only recon a dynamic object, but also infer a dynamic object by timestep-input) (nerf: k-planes) interactive (relatively blank research） ..... these should be included in the ultimate task-definition.

路漫漫其修远兮，大神敢去德国读博，听说很难，那就盼望你上下求索搞出一个： super-vision, unified reconstruct anything. Yes, i am on the way, tooooooooooooooooooo.

yuedajiong commented 1 year ago

I summarized the essence of ECON, is this correct?

econ

yuedajiong commented 1 year ago

my bitter lesson: there is no-reconstuction, just conditional-generation. if we want to 3d construct anything, we must use priori BY DEFAULT. whether explicit(mesh, ...) based, or implicit(nerf, if, ..) based. otherwise, we're on an endless way.

我每天告诉自己：除了对逼真的真人，我可以搞一些特殊的手工的动作；其他任何对象，都不能case by case的手工搞。限制自己走一个短期的捷径出一点效果。

yuedajiong commented 1 year ago

假设我们需要通过拆开表壳的几张照片，一些描述文字（包含三问等），或者其他controlnet这种的表述约束的形式，“构建（无论算法是重构还是生成）” 一只正在走时的瑞士钟表，最后的输出必须是，表的静态结构（含各个独立零件和相对咬合关系），和一秒一秒的时间戳就能够正常的运作（包含整体的物理约束正确，部件的互作关系。） ---- 这样一个相对很自动不太需要手工步骤的算法，可能就是那种比较 super-vision的unified, full-automatic, strong-constraint的。

可以想象的就是，最后这一步，内部应该是有迭代循环，一定要利用已经学得的先验（哪怕最简单的钟表，至少有几个齿轮盘），已经在各种场合学得了各种约束（比如60倍关系）。现在大家都是构造的静态的相对简单的外观的东西为主，很难想象通过silhouette, rendered-rgb,和一些本质已经是先验体现的edge,normal,laplacian_smoothing等loss正则项，就能fit好精细的机械结构。

yuedajiong commented 1 year ago

@YuliangXiu 大神，关于hair和garment，我折腾到现在，理解的总结：

hair:
现在经典的做法，要求低或者说质量不那么好的算法，很多用一个固定的比如半球，通过占用与否把脸露出来，相对静态的中端头发做法，输入图片得到发束信息，然后接一个细化的步骤从最后图像上欺骗一下眼睛。发束能够支持的数量其实不大，太大计算量很大，就算发束（比如每根10段100段，几千上万束），表现力也不够。我看到的效果最好的nvidia的那个interactive hair admm那个。只有简单的paper不开源，nvidia最近几年很多工作都不开源了。我看中nvidia那个效果，而且还有交互式。但我还是不认为那是最好的，我个人觉得dynamic interacive implicit(nerf or hair-related implicit function?)可能更好，最后也能很好的视觉效果，可交互，动态（比如与头皮和风的互作）

garment: 现在我看到三类表示：一类就是利用smpl这种，把顶点扯起来作为衣服，当然也可以增加一些顶点表示特别宽松的奇葩的衣服。一类是在外面套一个标准的壳，用这个独立的mesh壳来表示衣服，其实表示能力有限。最后还有一类也就是我最后选择的，就是真的针对不同的衣服，有一个独立的mesh. MGN这个算法也是。第三种做法，在后续的运动中，驱动smpl, 然后要去随动计算garment的变化。计算不好，就是我上面图片这种露肉。当姿势很极端的时候，基于法向量为主的对garment的计算移动，其实很有挑战。假设我们想针对一张照片，得到一个人物的静态的立体结构，含有衣服，然后做成一个手办雕塑，水晶雕塑，我觉得不区分身体和和衣服，是可以的。但是，我们更需要的是那种和真实世界，至少现在好的立体游戏那种，希望身体和衣服互作，甚至可以看到角色脱下穿上衣服的换装过程，感觉最原始的任务定义，就得把二者给分开。

当然，超越这种具体得技术点的一个更大的问题是： 10年30年后，回过头来看，我们能object-by-object, component-by-component，step-by-step的去解决这么庞大的视觉世界的构造甚至操纵吗？

我的目标是super-ai，包括统一symbol&vision。我把mesh, nerf等这近5年的主流paper几百篇都看了，也一个一个的项目去折腾去分析开源弄了大几十个，然后也用了很人肉的方法，构建了：照片克隆真人，文字视频驱动运动，（比如：刘亦菲在陆家嘴跳舞）；在逼真度，整个功能地完成度上，也很领先。所以，我一边动手，一边不断地提出和否定自己，到底应该如何选择技术路线。

那种super-vision，很unified，到底应该什么样子？

我个人觉得，如果要面向那种终极的vision算法，走显式的路，行不通。 mesh类的export只应该是最后一个给UE等下游用的时候，为了快速渲染和驱动用。

其实，特别涉及到“含时”和“交互”的时候，不太可能把复杂的对象和运动，用mesh+motion这种显式的预先表示处理。我理解，应该是on-daemon按需的得到观察者某camera pose下，当前scene(interactive objects)下，我当前这个对象，shape应该什么样子，texture应该什么样子，然后尽力实时的得到2d的图像给当前位置；这个好比量子的波函数，观察者来看了，我就坍塌了show给你看。如果非得要弄成mesh这种离散的，就算弄成UE那种分几个粗细程度，也没法办法得到隐式连续动态地表示那样的无限。

YuliangXiu commented 1 year ago

We recently released a new work, TeCH, it could produce detailed back-side surface with texture, which is consistent with the front-side.

Homepage: https://huangyangyi.github.io/TeCH/ Code: https://github.com/huangyangyi/TeCH

YuliangXiu / ECON

I went the other two ways: 1)unified e2e, data driven for priori as constriant. 2) pipeline, but separate clothes even hair etc. #70