facebookresearch / ijepa

Official codebase for I-JEPA, the Image-based Joint-Embedding Predictive Architecture. First outlined in the CVPR paper, "Self-supervised learning from images with a joint-embedding predictive architecture."
Other
2.8k stars 355 forks source link

yuedajiong-question #06: what is the better pathway to create unified vision for super-AI? #11

Closed yuedajiong closed 1 year ago

yuedajiong commented 1 year ago

iJEPA?

I think: an unified 3d task is better? (conditional-generation/reconsuction for priori-remember and implicit-explicit representation)

Image(s)/Video -> f_cond_gen_as_recon(timestep, ...) -> [implicit-representation-by-object] #here123 + camera-information(origin+direction) ->f_diffentable_render_not_only_nerf(cam_info, scene-representation, timestep, implicit-representation-by-object ...) -> Image(s)/Video

scene-representation: for multi-objects interaction. timestep: for dynamic, not only train/recon, but alse infer/gen.

here123: if need explicit, we can add a mapping branch to transform object implicit-representation to explicit, like nerf to mesh.

this is E2E differentable.

--- we can train this unified vision, after trained:

  1. we can use the components in implicit-representation-by-object, and laten tensors in f_cond_gen_as_recon for upper LeCun's total-JEPA.
  2. we have a vision-oriented world-model(explicit) for upper thinking, the system can image that: A car go off a cliff. --- we can train this vision system independently.

just from my view:

  1. task definition is very important. most of research works will be valueless, because the algorithms are just for that limited and local task definition.
  2. what is the better unified vision task? (if we have a chance to rethink before super-AI arrival.) P0: As much as possible to satisfy physical reality: 3D in physical --> 2*2D in human --> reconstruced 3D P1: multitask for necessary information can be keep in network.

And this is not a joke. It is a tragedy:
most of researcher and entrepreneur, focus on LAAAAAAAAAAARGE model, but not an AAAAAAAAAAAdvaned small model first and scale-up when the small guy is smart enough.

under the large model benchmark leaderboard, no a single grass grow.

MidoAssran commented 1 year ago

This is certainly an interesting proposition. In my view, the value in the approach your describing is better modeling of depth, temporal, and general physical uncertainty of a given scene. I think going beyond 2D spatial uncertainty and modeling other notions of uncertainty is highly valuable and should be considered as next steps for research exploration.

In terms of the implementation, it is not clear to me whether this modeling has to be generative, i.e., whether you need to go back to making predictions in input space. In I-JEPA, we demonstrated that one can model spatial uncertainty in an image by simply doing prediction in a learned latent space. I believe a similar principle will be true for modeling the aforementioned types of uncertainty as well.

yuedajiong commented 1 year ago

THANKS @MidoAssran !!! I have read all of your replies, line by line, char by char. you are so kindly.

You: "whether you need to go back to making predictions in input space."

Me: of couse, part of our task, we can just use highly abstract representation to make a decision, that is enough.

but, just from my personal limited understanding, as I mentioned in previous post(s): most of research works will be valueless, becasue the task-definition is too limited, local, ...

what is proper task-definition, today? (from 2023 to next-ten-years, computing-power and [AI] algorithms developing) try to align the complexity of the tasks that ordinary humans currently processing, especial to AGI/Super-AI researchers.

so I made a mini world including vision and symbol/language, very very mini.

I want to give two examples:

  1. vision, automatic drive 'a car is driving forward, and a piece of glass is falling from the sky, a bucket of water is being splashed onto the car, and a steel rod is being inserted into the car.' (translated by ChatGPT) what is the scene in the next second?
  1. symbol, chatgpt-like 'Please repeat the fifth/5 letter of the fourth/4 word in the third/3 line of the second/2 paragraph of this article six/6 times to emphasize it.'

I just want to say:
sometimes, we need do some operations on original space, not only vision, but also symbol. What is the philosophy behind this? we can do make-decision on abstract space, use the shortcut paradigm, yes, that is AI; but, more often, we need go back to original space, for 'Validation' and 'Operation', even 'Visualization'. for validation, the information is strict equality; for operation, the original space is our operation-target, visualization is to align with human to help human understanding.

Even strictly follow LeCun's world model routing, in a concrete implementation, I think, we should have the ability to do computing in original space(vision imaging and evaluation), too much vision-related tasks.

In fact, same necessary in symbol or other abstract space: (example in chatgpt) "Me: what is the middle char in word super-intelligence? "ChatGPT: The word "super-intelligence" has an even number of characters. In such cases, there is no single middle character. However, we can identify the two middle characters. In this case, the two middle characters are "li."" that is diffucult to do precisely operation in original space.

we can make 'Decision' on abstract space in shortcut, BUT, we need go back to original space for 'Validation' and 'Operation'. this is my super-AI understanding, yes, if we want 'Super', the task definition must be proper. maybe ridiculous. :-(