The symbiotic relationship between humans and technology has always been about transcending limitations. From the wheel to the printing press, to the internet, and now AI – each innovation has expanded human capabilities. Using AI to extend one's cognitive reach is just the latest chapter in this story.
The dream is to achieve a harmonious balance where technology feels like a natural extension of oneself, enhancing capabilities without overshadowing the human essence.
As technology evolves, and as we understand more about the human psyche and neurology, the pathways to achieve this balance will become clearer. The journey towards this future is filled with challenges, but also with immense possibilities.
Here, we imagine the potential of a symbiotic relationship between human and artificial intelligence, what it may look like, and how it can be achieved.
[!NOTE]
This document will be updated as we refine it from a dream into a plan. We are currently working on an Exocortex prototype here that you can inspect and try.
The Exocortex: An Echo of Experiences
Definition: The Exocortex is a remembrance agent, a generative agent specialized in remembering a narrative of observed events.
Vision: A timestamped and vectorized timeline of experiences and notes. Key to this vision is the concept of the memory stream, a dynamic entity that evolves with new experiences. It's designed to provide a sense of continuity, using time-ranked retrieval of related memories to provide context for new experiences.
Enhancing communication: The immediate application and likely testing grounds will be to create a timeline of real events and notes in a specific domain to create a basic prototype, with the aim to identify and eliminate communication bottlenecks. Through this, constructs can collaborate or exchange knowledge and experiences autonomously.
Privacy: First priority, never to be undermined. All computation should be performed on-device, and all communication should be done using peer-to-peer technologies like IPFS to keep data ownership in the hands of users. The exact AI model used should be swappable by the end user. Users should be able to create separate exocortexes for work and personal life, with the option to combine them together.
Memory Stream Architecture
Initial Memory Values:
Recency: Each new memory is timestamped upon creation. Recency decays over time, using an exponential decay function (e.g., decay factor of 0.995) based on the number of hours since the memory was last retrieved.
Importance: Assigned an initial score when created, indicating the poignancy of the memory on a scale from 1 (mundane) to 10 (extremely poignant).
Relevance: Initialized based on the context in which the memory was created, and updated based on the similarity between the memory’s embedding vector and the current context’s embedding vector.
Final Retrieval Score: The retrieval function scores all memories as a weighted combination of recency, importance, and relevance. Scores are normalized to the range of [0, 1] using min-max scaling. In the current implementation, all weights (𝛼) are set to 1. The top-ranked memories that fit within the language model’s context window are included in the prompt.
Continuity Across Memories: As new events or memory entries occur, the system retrieves past memories based on factors like time, relevance, and importance. This provides context for the new memory entry. Objective observations become personalized to past experiences, especially the immediate past.
Meta-Observations: Removes the need to include the full memory transcript in the context window.
By treating memory recollection as a new observation, it can add new context to old memories without overwriting them, and it boosts the odds of being recalled again in the near future, mimicking an organic working memory.
Doing this also changes how things are remembered through the lens of the active context.
This is reminiscent of how human memory works: recalling a memory can change how it is remembered, and the act of remembering can itself become a new memory.
Automated Memory Ingestion
Time-synchronized first-person observations of events are ideal for incorporating into the memory stream.
Audio-based Memories: Utilizing the WhisperX library for audio transcription. Includes Speaker Diarization, does not include speaker detection. We'll need to know when the user is the speaker, at minimum.
Image-based Memories: Employ the LLaVa library for image description. Multimodal LLMs are capable of rich text descriptions, though not always with the greatest accuracy. We're prototyping, so we'll roll with it while the field keeps moving. Human visual memory isn't perfect either.
Future: Transition to hardware solutions, like smart glasses with camera and mic or a body cam.
Additionally, as noted in the paper "Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction"[^2], structured scene descriptions allow models to predict what might happen in-between or in subsequent frames. This has significant potential in various applications that can be explored later.
Memory Consolidation (SOM Sleep)
A Self-Organizing Map (SOM) is a type of artificial neural network that is trained using unsupervised learning.
Mimicking Human Sleep Cycles: During 'SOM Sleep', the Exocortex would use SOMs to reorganize and consolidate memories, akin to how human brains are believed to consolidate memories during sleep. This process could involve strengthening important connections between memories and weakening or pruning less important ones.
Contextual Integration: During SOM Sleep, the Exocortex could use SOMs to integrate recent memories with old ones, updating and recontextualizing short-term memories into long-term memories based on new information and reflections.
Memory Pruning and Cleanup: SOM Sleep could also involve cleaning up the memory storage, identifying and safely removing redundant or irrelevant memories, similar to how our brains are believed to prune unnecessary connections during sleep.
OMNIA: An Echo of Environments
Definition: O.M.N.I.A. (Operating Matrix of Networked Intelligent Avatars) is a digital environment constructed with natural language observations of the local environment. In the original paper "Generative Agents: Interactive Simulacra of Human Behavior."[^1],Smallville was crafted by hand. By extracting environmental or synthesizing data for one or more exocortex, we can roughly simulate how a construct might interact with it.
This concept opens the door for constructs to act within a real or simulated space, and respond to it in meaningful ways, even engaging in proactive assistance personalized to the user's needs.
Normally when interacting in this world, entities other than yourself are driven by generative agents. Substituting those agents with the construct of real people (inhabiting descriptions of their own local space) will enable intelligent exchange of information in a highly dynamic and meaningful way, without violating the rights and privacy of the parties involved.
Conclusion
The foundations are emerging technologies, and it will require extensive research and advancements in the field of machine learning to pull off. Luckily, all the pieces are there and ready to pick up.
[^1]: Joon Sung Park and Joseph C. O'Brien and Carrie J. Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv preprint arXiv:2304.03442.
[^2]: Vaishnavi Himakunthala and Andy Ouyang and Daniel Rose and Ryan He and Alex Mei and Yujie Lu and Chinmay Sonar and Michael Saxon and William Yang Wang (2023). Let’s Think Frame by Frame: Evaluating Video Chain of Thought with
Video Infilling and Prediction. arXiv preprint arXiv:2305.13903.
The symbiotic relationship between humans and technology has always been about transcending limitations. From the wheel to the printing press, to the internet, and now AI – each innovation has expanded human capabilities. Using AI to extend one's cognitive reach is just the latest chapter in this story.
The dream is to achieve a harmonious balance where technology feels like a natural extension of oneself, enhancing capabilities without overshadowing the human essence.
As technology evolves, and as we understand more about the human psyche and neurology, the pathways to achieve this balance will become clearer. The journey towards this future is filled with challenges, but also with immense possibilities.
Here, we imagine the potential of a symbiotic relationship between human and artificial intelligence, what it may look like, and how it can be achieved.
The Exocortex: An Echo of Experiences
Definition: The Exocortex is a remembrance agent, a generative agent specialized in remembering a narrative of observed events.
Vision: A timestamped and vectorized timeline of experiences and notes. Key to this vision is the concept of the memory stream, a dynamic entity that evolves with new experiences. It's designed to provide a sense of continuity, using time-ranked retrieval of related memories to provide context for new experiences.
Enhancing communication: The immediate application and likely testing grounds will be to create a timeline of real events and notes in a specific domain to create a basic prototype, with the aim to identify and eliminate communication bottlenecks. Through this, constructs can collaborate or exchange knowledge and experiences autonomously.
Privacy: First priority, never to be undermined. All computation should be performed on-device, and all communication should be done using peer-to-peer technologies like IPFS to keep data ownership in the hands of users. The exact AI model used should be swappable by the end user. Users should be able to create separate exocortexes for work and personal life, with the option to combine them together.
Memory Stream Architecture
Initial Memory Values:
Final Retrieval Score: The retrieval function scores all memories as a weighted combination of recency, importance, and relevance. Scores are normalized to the range of [0, 1] using min-max scaling. In the current implementation, all weights (𝛼) are set to 1. The top-ranked memories that fit within the language model’s context window are included in the prompt.
Continuity Across Memories: As new events or memory entries occur, the system retrieves past memories based on factors like time, relevance, and importance. This provides context for the new memory entry. Objective observations become personalized to past experiences, especially the immediate past.
Meta-Observations: Removes the need to include the full memory transcript in the context window.
Automated Memory Ingestion
Time-synchronized first-person observations of events are ideal for incorporating into the memory stream.
Audio-based Memories: Utilizing the WhisperX library for audio transcription. Includes Speaker Diarization, does not include speaker detection. We'll need to know when the user is the speaker, at minimum.
Image-based Memories: Employ the LLaVa library for image description. Multimodal LLMs are capable of rich text descriptions, though not always with the greatest accuracy. We're prototyping, so we'll roll with it while the field keeps moving. Human visual memory isn't perfect either.
Future: Transition to hardware solutions, like smart glasses with camera and mic or a body cam.
Additionally, as noted in the paper "Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction"[^2], structured scene descriptions allow models to predict what might happen in-between or in subsequent frames. This has significant potential in various applications that can be explored later.
Memory Consolidation (SOM Sleep)
A Self-Organizing Map (SOM) is a type of artificial neural network that is trained using unsupervised learning.
Mimicking Human Sleep Cycles: During 'SOM Sleep', the Exocortex would use SOMs to reorganize and consolidate memories, akin to how human brains are believed to consolidate memories during sleep. This process could involve strengthening important connections between memories and weakening or pruning less important ones.
Contextual Integration: During SOM Sleep, the Exocortex could use SOMs to integrate recent memories with old ones, updating and recontextualizing short-term memories into long-term memories based on new information and reflections.
Memory Pruning and Cleanup: SOM Sleep could also involve cleaning up the memory storage, identifying and safely removing redundant or irrelevant memories, similar to how our brains are believed to prune unnecessary connections during sleep.
OMNIA: An Echo of Environments
Definition: O.M.N.I.A. (Operating Matrix of Networked Intelligent Avatars) is a digital environment constructed with natural language observations of the local environment. In the original paper "Generative Agents: Interactive Simulacra of Human Behavior."[^1],Smallville was crafted by hand. By extracting environmental or synthesizing data for one or more exocortex, we can roughly simulate how a construct might interact with it.
This concept opens the door for constructs to act within a real or simulated space, and respond to it in meaningful ways, even engaging in proactive assistance personalized to the user's needs.
Normally when interacting in this world, entities other than yourself are driven by generative agents. Substituting those agents with the construct of real people (inhabiting descriptions of their own local space) will enable intelligent exchange of information in a highly dynamic and meaningful way, without violating the rights and privacy of the parties involved.
Conclusion
The foundations are emerging technologies, and it will require extensive research and advancements in the field of machine learning to pull off. Luckily, all the pieces are there and ready to pick up.
[^1]: Joon Sung Park and Joseph C. O'Brien and Carrie J. Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv preprint arXiv:2304.03442. [^2]: Vaishnavi Himakunthala and Andy Ouyang and Daniel Rose and Ryan He and Alex Mei and Yujie Lu and Chinmay Sonar and Michael Saxon and William Yang Wang (2023). Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction. arXiv preprint arXiv:2305.13903.