Closed bharathraja closed 11 months ago
GPT can't generate text anywhere near fast enough to react to a real-time environment, and if it could then we'd be broke from the token costs of generating detailed actions 60 times a second.
This isnβt the worst idea, could be done if the gradio tools plug-in is fixed up a bit
I have embeddings fever and it's exciting to see that we got a latent space for motions going, but multimodal, spatiotemporal autoencoding and decoding isn't cheap.
this should probably be renamed to something like "multi modality" ?
This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.
This issue was closed automatically because it has been stale for 10 days with no activity.
Duplicates
Summary π‘
Can the auto-GPT be expanded to take in multi-modality input such as image, audio, touch and can act through the humanoid robot body ? The modularization of image object recognition, audio-text processing and touch based input tokenisation into text format would integrate all senses. This would make truly autonomous humanoid robots. this can be tested in the simulated environments like OpenAI gym initially.
There are literature which has expanded GPT ability to human action sequences, like this one: https://actiongpt.github.io/
Examples π
No response
Motivation π¦
i-Robot movie