facebookresearch / jepa

PyTorch code and models for V-JEPA self-supervised learning from video.
Other
2.53k stars 242 forks source link

A few thoughts on JEPA: task-goal / task-definition #61

Open yuedajiong opened 1 month ago

yuedajiong commented 1 month ago

(1) LeCun and his collaborators or doctoral students are all experts, and I greatly admire them. (2) Technically, my understanding may be incorrect.

Just technical thoughts!!!

I always feel that JEPA is not quite suitable or expansive/leapfrogging enough in terms of task goal/task definition, which leads to the JEPA series algorithms are not enough even if they are optimized well.

(1) Even in the vision alone, JEPA does not explain the representations it learns internally, nor what representations such as world models are. Are they still just distributed weights of ordinary neural networks, or are there special network structures like laten representations, or are there explicit 3D/4D representations? Without delving into the details of JEPA, looking at this network in a general way does not show a significant difference, nor does it provide a special task definition leap for stronger AI like AGI/ASI. Even at the forefront like JEPA, I believe that even when focusing solely on pure vision tasks, there hasn't been a fundamental breakthrough. I don't think an ideal, powerful vision system should be a one-way, one-shot, one-train-many-infer system similar to LLMs. Each visual processing involves multiple visual recognitions occurring in parallel, alternating and iterating repeatedly before producing the final output.

(2) From the perspective of the perfect task ultimate form of vision, I personally believe it should be, like humans, being able to construct a 3-dimensional world from a single image/a pair of images (with left and right disparity) or video, without needing the camera/observation position, and even a dynamic 4-dimensional world (in most cases, not requiring physical-level precision). Here, it could be laten representation, but it would be better to have an explicit representation (such as point-cloud, surface-mesh, gauss-splatting, ...).

(3) In order to support various high-level applications such as differentiable form prediction/inference/planning based on vision, this laten or explicit representation can be utilized by neural networks. (For example, estimating how moving objects maneuver around a building on the road). Depending on the requirements of more applications, this 3D/4D representation may also need estimated-distance and semantic labels.(human, building; stone, swamp; water, fog,glass ...; old or new; soft or hard; ...)

(4) I want to append some sampels (follow) about vision-AI but not limited on vision-only.

Not questioning the great minds, just pondering what is the ultimate definition of the visual task. After JEPA is done well, how far away are we from it?

yuedajiong commented 1 month ago

the diversity of visual tasks, another example among hundreds of tasks: from dynamic-vision-objects to symbolic-functions

superi-cv-vision-to-symbol

https://github.com/yuedajiong/super-ai-objective-world

IMPORTANT: the functions are blackbox, even the symbolic abstraction is forced and passive. Not plan/design by algorihtm.

yuedajiong commented 1 month ago

the diversity of visual tasks, another example among hundreds of tasks: strong positional/orientation information dependence (more tasks: regarding equivariance (not invariance) dependence; reliance on raw 2D spatial information; ...)

direction-1 335788505-cb1c8299-16ce-49d9-a909-301762209617

direction-2