Modularized droidlet: refactored design

This is a fairly involved rearchitecting of droidlet, so I'll provide as much context as possible. I've discussed this with @kavyasrinet and @aszlam who both approved this direction.

Motivation

There are several components of droidlet that are useful by themselves. Perception, memory, low-level control primitives and dialog models are components that are independently useful in many projects without having to build a monolithic agent.

Making droidlet a set of independent components that compose well together can extract a lot more value for the user.

There are several projects in active learning, unsupervised learning, AR/VR, dialog assistants that can be formulated with only a subset of components available in droidlet. Yet, as of today, we need to create monolithic agents to do research on any of these topics.

The current state of the codebase doesn't encode that independence assumption. Almost every component of droidlet is dependent on another, and all of them are dependent on having an opaque access pattern via self.agent.

Here is how the dependency graph of droidlet looks like as of today:

drawing

This pattern is almost universally present. Here are some examples:

./base_agent/nsp_dialogue_manager.py:        ProgramNode.create(self.agent.memory, d)
./locobot/agent/perception/perception.py:                "input": InputHandler(self.agent, read_from_camera=True),
./craftassist/agent/low_level_perception.py:            BlockObjectNode.create(self.agent.memory, [(xyz, idm)])

The second issue with this monolithic state is that testing and debugging have become fairly hard, because of action-at-a-distance effects. From our unit-testing, to our end-to-end testing via oncalls, we see the effects of interdependence. @kavyasrinet mentions that it is common during minor refactors as well.

Removing action-at-a-distance interactions almost universally outweighs the debugging and testing complexities of keeping it.

The Proposal

drawing

The proposal is simple in nature, though the work involved is a lot of both mechanical refactoring and redesigning several APIs to be nicer in this new world.

perception, memory and dialog are Independent components that have no dependencies to other components in the project.

Here is the folder structure for them in the root directory.

perception/
memory/
dialog/
lowlevel/
    minecraft/
    locobot/
    franka/
    hellorobot/
    habitat/

They will install as droidlet.perception, droidlet.memory, droidlet.dialog, ...

interpreter will depend on memory. We will completely clean out the notion of a controller which was often confused with/interchangably used with interpreter, and instead just have interpreter.

interpreter/
    base/                     # level 0
    robot/                    # level 1
    minecraft/                # level 1
    franka/                   # level 2
    hello/                    # level 2

The interpreter tests need memory fills, i.e. entries in memory to test functionality correctly. So, interpreter unit tests will need some small agents (not in the old traditional sense with a monolithic class structure, but just a quick function that creates memory entries).

Lastly, there will be an agents folder that has one file per agent of a certain characteristic.

agents/
    robot_assistant.py
    minecraft_assistant.py
    active_learner.py
    object_grasping.py

The agents will be a single file each, and can be "unstructured" or "structured". More on this later in this proposal.

Memory schema and Interpreter Dictionary

There are three monolithic memory schemas that exist right now:

droidlet schema - Level0
+locobot schema - Level1
+minecraft schema - Level1

Either the minecraft or locobot Level-1 schemas can be composed upon the droidlet base schema which we call a Level0 schema.

While we could try to over-modularize the memory-schemas into "perception schema", "arm schema", etc., we aren't aiming to do so in the current proposal.

The end state we'd like to see in this current refactor is to keep this composable leveling pattern, and if needed, formalize a "leveling" or "composablity", so that one knows either from documentation or via a runtime error which schemas are composable on top of others. For example, it should be encoded that minecraft depends on / composes on top of the droidlet schema.

Similarly, we have dictionaries in the canonical interpreter representation, where some base words are universal, but some specific words are related to capabilities of a given environment or agent.

For example, move is a common word in all agents (but doesn't have to be), but grasp or pinch might be related to specific agent capabilities. Some environments can have new capabilities, for example swim.

We will have a base interpreter dictionary, and then have subsequent composable levels. This is again not too far from the status quo, so I won't detail it here. There is no plan to break the vocabulary down into further hierarchical composable vocabularies than the current scheme we have.

Pseudo-code

Individual Components

Once we get rid of self.agent and self.memory in almost all the other components, this means two things:

These components no longer need to carry state, and hence can be purely functional
The exchange of state is explicitly what the agent enables

For example, using the object detector would simply be:

from droidlet.perception import ObjectDetector

object_detector = ObjectDetector(classes="coco")  # loads the pre-trained detector weights, etc.
objects = object_detector(input_image)

To use something like the object tracker that is stateful, one does something like this:

from droidlet.perception import ObjectTracker

tracker = ObjectTracker()
all_objects = []

objects1 = object_detector(input_image_1)
unique_objects1, all_objects = tracker(objects, all_objects)

objects2 = object_detector(input_image_2)
unique_objects2, all_objects = tracker(objects, all_objects)

Similarly, dialogue manager won't store the previous conversational state within itself, and will be purely functional:

from droidlet.dialog import DialogModel
# here, DialogModel is the one that includes pre-processing and post-processing

model = DialogModel(vocab=["droidlet", "locobot"])

previous_state = []
input = "go to the chair"

dialog_object = model(input)

previous_state+= [input, dialog_object]

Agent code

Simple functional agent

This results in a full-scale agent doing mostly state-passing and event-loop processing.

A simple agent would be a giant ~200 line function that creates all necessary components of the agent and defines how they interact with each other.

from droidlet.perception import ObjectDetector, ObjectTracker, HumanPose
from droidlet.memory import Memory
from droidlet.lowlevel import Locobot
from droidlet.interpreter.robot import Interpreter
from droidlet.dialog import DialogModel
from droidlet.utils import BackgroundTask
from dldashboard.utils import get_text

def agent():
    # Initialize components

    # perception
    detector = ObjectDetector(...)
    tracker = ObjectTracker(...)
    human_pose = HumanPose(...)

    def perceive(rgbd, prev_state):
       objects = detector(rgbd)
       unique_objects, next_state = tracker(objects, prev_state)
       humans = human_pose(rgbd)
       return unique_objects, humans, next_state
    perceive_background = BackgroundTask(perceive)

    locobot = Locobot(ip="IP_ADDRESS")
    mem = Memory(schemas=["droidlet.schema", "locobot.schema"])
    vocab=["droidlet", "locobot"]
    dialog_model = DialogModel(vocab=vocab)
    interpreter = Interpreter(vocab=vocab, memory=mem)

   while True:
       # start the task loop
          try:
              text = get_text()
              dobj = dialog_model(text)
              mem.push("dialog_state", [text, dobj])
              interpreter.push(dobj)
           except TextNotAvailableException:
               pass

           rgbd = locobot.get_rgb_depth()
           prev_objects = memory.get_all("perception_objects")
           perceive_background.process(rgbd, prev_objects)
           try:
               unique_objects, humans, next_state, perceive_background.dequeue()
               mem.push("perception_objects", unique_objects)
               mem.push("perception_humans", humans)
           except NotReadyException:
               pass

           actions = interpreter.get_actions()
           for action in actions:
               locobot.execute(action)

The only state that is being pushed or pulled outside of this code block is when the interpreter directly interacts with memory. All other state is explicitly carried through the agent implementation here.

This might be slightly more verbose, but it does benefit from two productivity boosters:

The code is much easier to read for people not familiar with the codebase, or even generally
Debugging becomes much easier because you are really only just debugging this function, assuming that all the imported components are fairly well-tested

Pub-Sub agent

I haven't fully fleshed out the design for this style of agent, but, as an opt-in, you can use this style when dealing with complex projects. In some future distribution, we could even replace a trivial but not-as-performant pub-sub implementation with a much tighter C++-optimized one.

Final design here still TBD.

from droidlet.perception import ObjectDetector, ObjectTracker, HumanPose
from droidlet.memory import Memory
from droidlet.lowlevel import Locobot
from droidlet.interpreter.robot import Interpreter
from droidlet.dialog import DialogModel
from droidlet.utils import BackgroundTask
from dldashboard.utils import get_text, save_to_mem

def agent():
    # Initialize components

    # perception
    detector = ObjectDetector(...)
    tracker = ObjectTracker(...)
    human_pose = HumanPose(...)

    def perceive(rgbd, prev_state):
       objects = detector(rgbd)
       unique_objects, next_state = tracker(objects, prev_state)
       humans = human_pose(rgbd)
       return unique_objects, humans, next_state
    perceive_background = BackgroundTask(perceive)

    locobot = Locobot(ip="IP_ADDRESS")
    mem = Memory(schemas=["droidlet.schema", "locobot.schema"])
    vocab=["droidlet", "locobot"]
    dialog_model = DialogModel(vocab=vocab)
    interpreter = Interpreter(vocab=vocab, memory=mem)

    @dlevent.publish("rgb")
    def rgb_pub():
        return locobot.get_rgb_depth()

    dlevent.publish("text", get_text())
    dlevent.trigger("rgb", perceive_background, publish="detections")

    dlevent.trigger(["rgb", "detections", "text", "dialog"], save_to_mem)

    dlevent.trigger("dialog", lambda ir: interpreter.push(ir))

    @dlevent.subscribe("interpreter_get_action")
    def take_action(actions):
        for action in actions:
        locobot.execute(action)

This pub-sub model doesn't make a ton of sense in this toy example, but it could start making sense in fairly complex projects where you do need to break many parts of this logic into hundreds of lines or even separate files.

Breakdown of tasks

I haven't fully mapped the order in which PRs will appear, but I will keep them incremental and hopefully easy to review.

this is really great.

some comments:

I know it's superficial, but I don't think we should drop "controller" from our lexicon/slides/docs etc. This is not a statement about the code organization you have proposed, but the abstract agent organization (and imo these don't have to line up). currently the controller is not just the interpreter, but the interpreter and the dialogue manager and semantic parser (and default behaviors). The distinction is important- NLP people often think "eh, semantic parsing, can't do much with that", and its true- the current semantic parser can only handle a single utterance at a time (no dialogue) and sees only text (and no world state). otoh, the interpreter and the dialogue manager are completely scripted, and are brittle. Nevertheless, the combination of (brittle) dm and interpreter and mlified but limited parser has a lot more capability than any of these pieces alone. At a high level, the "controller" is the thing that decides which tasks to put on the queue, based on the state of memory. an "interpreter" is a thing that takes logical forms/programs from the parser and makes them executable using world state from memory (so interpreter is a proper piece of controller)
w.r.t. dialogue manager: see @kavyasrinet's refactor. state changes will be stored in memory- it won't be completely functional but we will have some basic arrangements for the correct side effects
I would like if we commit to our own agents being structured. I think @soumith's design allows unstructured agents, and that is good, and I want people not us to be able to use the library and if they want to build unstructured agents or just use e.g. perceptual components that is awesome, and we should support it. But I also want that the basic arrangement of the agents we build for hello/minecraft/franka/dji etc, at least for the near term, to share a lot of structure and in particular to keep the basic event loop like in the old core.py with perceive() and step() (but of course the agent would do the routing of perception to memory etc. in this new architecture, instead of perception to memory directly....).

Thanks so much for putting this together - this looks great.

On a high level, I really like pushing everything through agent, but one thing that comes to mind is should we have the design so that interpreter also communicates to memory through the agent ? This is just me thinking out aloud - but this would get us closer to keeping these components self sufficient because right now the interpreter and memory are closely tied in that the interpreter needs to know the schema of memory in order to be able to return an action at the end. What I am proposing is for example:

text comes in -> "go to the chair"
We get the logical form from the DialogModel
Send this logical form to interpreter -> interpreter reads the dictionary and sends back requests to agent to fill up missing information from memory ( in this case coordinates of "chair")
Agent fetches info from perception and updates memory
Agent returns the interpreter's request
Interpreter now sends back a task /action in the snippet above.
Agent executes the task in its environment using low_level capabilities

The only reason I think Interpreter directly through the Agent is not needed is because it fails the "independently useful" test (from our discussions). The interpreter doesn't seem to be useful without an attached memory right?

But if it is generally valuable to do this decoupling to reduce action-at-a-distance effects, then I am supportive.

Right so the main job of the interpreter right now is to read the logical form which is a partially executable program, to fill in missing pieces in the logical form by querying memory and then to push a task on task stack. A task here is something that's executable in the environment, for example: Move to (0, 0, 0).

The reason interpreter (for example "put_memory_handler") can cause an action-at-a-distance effect is for example in this case : "tag that as x" where interpreter can directly write a triple to the memory.

But yeah trying to think of whether interpreter could be useful just by itself if we remove the memory I/O. I think we still need a component that can construct a task from logical form, but it isn't clear to me yet whether it passes the independently useful test. What do you all think ?

even if interpreter is not independently useful, I feel like @kavyasrinet's suggestion is quite nice- once we have decided that modules route through agent, then imo interpreter should route memory requests through agent too.

okay, let's do it. I'll disentangle memory from the interpreter as well.

Where should artifacts like models , datasets etc live in the new refactor ? Can I propose an : droidlet/artifacts/ folder ? This can also be a step towards our model and dataset zoo. So if we do it at the root level, it would look like :

artifacts/
    base/
        models/
           semantic_parser/
        datasets/
           semantic_parsing_datasets/
    craftassist/
        models/ (any craftassist models here)
        datasets/
    locobot/
        models/
        datasets/
perception/
memory/
dialog/
lowlevel/

the agent process will instantiate these and pass it down to the respective modules (for example : passing the dialog model to the dialog module)

and if we do it at the component level, it will look like :

perception/
    base/
        artifacts/
          models/
          datasets/
    craftassist/
        artifacts/
          models/
          datasets/
    robots/
        artifacts/
          models/
          datasets/
memory/
dialog/
    base/
        artifacts/
          models/
          datasets/
lowlevel/

I have a slight preference for the first one but we can discuss both (or any other way of organizing these)

the droidlet/ folder is generally installed as a package onto the system via python setup.py install, so a droidlet/artifacts directory is generally not writeable without sudo because it will be installed to something like /usr/local/lib/site-packages/...

Right now, we are using python setup.py develop as a temporary workflow, so the local droidlet is the same as the installed package.

So, putting artifacts inside the droidlet folder is not a good idea.

We have two options:

Keep the artifacts somewhere in the root of the project, for example at the root, have an artifacts/ folder alongside droidlet and agents.
Keep the artifacts (which are almost always downloaded) in this order:
- Where the user set the artifact path in code, via droidlet.set_artifact_dir(...)
- $DROIDLET_HOME env variable if available
- $XDG_CACHE_HOME env variable if available
- ~/.cache/droidlet as the default path Instead of passing the path to artifacts through all function levels, we can have a droidlet.get_artifact_dir(...) that walks through these prioritized resolutions to the path, and returns the path to the artifacts dir.

Reference: https://pytorch.org/docs/stable/hub.html#where-are-my-downloaded-models-saved

what do you think?

Oh yes, that's a good point about separating this to be outside of the directory. I have a slight preference for the first option because it's simpler but the second one looks neater. In the spirit of the refactor, in both cases, we also need some heirarchical directory structure like:

artifacts/
    base/
        models/
           semantic_parser/
        datasets/
           semantic_parsing_datasets/
    craftassist/
        models/ (any craftassist models here)
        datasets/
    locobot/
        models/
        datasets/

Tangential - that said, I think we should also refactor our data and model download scripts after this big refactor. Thinking of this as a user who, for example, wants to write a new agent, we should only force download some must-have artifacts (for example: parser) but not all , or even make it completely optional.

I just re-thought, (1) is not feasible.

Basically, if you run the unit tests or the agent from else-where (even worse, someone builds an agent in their own repository), then determining "the root directory" is not obvious / feasible. So, we'll have to do (2)

Here's some guidelines for Phase 2:

The only state that components in droidlet/ can keep is one that it externally loads and remains static after the initial load for the lifetime of the agent:

class Foo(...):

    def __init__(*args, **kwargs):

        # these are okay
        self.state = load_state_from_disk(...)
        self.sql_db = sqlite.connect(...)

        # these are not okay
        self.history_of_conversations = [] # to be updated every time a new conversation is processed by Foo

        # somewhere in between, where you have to make a judgement
        # here, you would rather actually pass the update_frequency as an
        # argument into the processing function, such as Foo.process_dialog(text, update_frequency)
        self.update_frequency = args[0]

Strive to make components purely functional (or static-functional).

If the component doesn't need any static state, make it a function rather than a class.

As you make components static-functional or pure functions, it gets much easier to write unit tests for these.

Let's expand unit test coverage for every part we make purely functional.

The natural thing to happen is that all state now has to be routed via tha agent in the agents/ folder. In this phase, don't worry about making the agents ugly in code. We will spend phase-3 restructuring agents.
Dependencies: as a natural consequence of (1) and (3), our dependency diagram above will start emerging (except, the only change is that tasks.py can depend on lowlevel). Enforce the dependency structure via unit tests and imports, i.e. memory and perception depend on nothing else for example, and the imports in the implementation files and unit tests should reflect that.

Lastly, I'll take Perception. @kavyasrinet wanted to take Memory. I'll let Yuxuan, you determine what you want to take.

Posting a high level workflow from interpreter -> tasks here just as an FYI based on discussion with @aszlam this morning: Example chat : "spawn a pig"

Agent calls DialogManager :

input: "spawn a pig"
output: 
{"dialogue_type": "HUMAN_GIVE_COMMAND", 
"action_sequence": [
       {"action_type": "SPAWN", 
        "reference_object": {
                  "filters": {
                       "triples": [{"pred_text": "has_name", "obj_text": [0, [2, 2]]}]
                 }
          }
     }
   ]
}

With the logical form output above, agent now decides to call interpreter with logical form, so :
```
input: 
{"dialogue_type": "HUMAN_GIVE_COMMAND", 
"action_sequence": [
       {"action_type": "SPAWN", 
        "reference_object": {
                  "filters": {
                       "triples": [{"pred_text": "has_name", "obj_text": [0, [2, 2]]}]
                 }
          }
     }
   ]
}
output: TaskNode
```
The way interpreter does this is: first read the logical form and then call handle_spawn (Note that this mapping of logical form -> handler call is done here. This handler in turn, reads through the subdict and returns maybe_task_list_to_control_block . In this case since we just want to spawn one pig (in other words - there is no repeats eg: "spawn 5 pigs" or composite actions eg: "spawn a pig and then come here") , the function returns a single task with task_list and added here.

Note that this is a Spawn task, defined here and returns a MemoryNode associated with this task.

TaskNodes are maintained in memory and here is how the agent steps through the tasks.

facebookresearch / fairo