Open soumith opened 3 years ago
this is really great.
some comments:
Thanks so much for putting this together - this looks great.
On a high level, I really like pushing everything through agent, but one thing that comes to mind is should we have the design so that interpreter also communicates to memory through the agent ? This is just me thinking out aloud - but this would get us closer to keeping these components self sufficient because right now the interpreter and memory are closely tied in that the interpreter needs to know the schema of memory in order to be able to return an action at the end. What I am proposing is for example:
DialogModel
"chair"
)The only reason I think Interpreter
directly through the Agent
is not needed is because it fails the "independently useful" test (from our discussions). The interpreter doesn't seem to be useful without an attached memory right?
But if it is generally valuable to do this decoupling to reduce action-at-a-distance effects, then I am supportive.
Right so the main job of the interpreter right now is to read the logical form which is a partially executable program, to fill in missing pieces in the logical form by querying memory and then to push a task on task stack. A task here is something that's executable in the environment, for example: Move to (0, 0, 0).
The reason interpreter (for example "put_memory_handler") can cause an action-at-a-distance effect is for example in this case : "tag that as x" where interpreter can directly write a triple to the memory.
But yeah trying to think of whether interpreter could be useful just by itself if we remove the memory I/O. I think we still need a component that can construct a task from logical form, but it isn't clear to me yet whether it passes the independently useful test. What do you all think ?
even if interpreter is not independently useful, I feel like @kavyasrinet's suggestion is quite nice- once we have decided that modules route through agent, then imo interpreter should route memory requests through agent too.
okay, let's do it. I'll disentangle memory from the interpreter as well.
Where should artifacts like models , datasets etc live in the new refactor ?
Can I propose an : droidlet/artifacts/
folder ? This can also be a step towards our model and dataset zoo.
So if we do it at the root level, it would look like :
artifacts/
base/
models/
semantic_parser/
datasets/
semantic_parsing_datasets/
craftassist/
models/ (any craftassist models here)
datasets/
locobot/
models/
datasets/
perception/
memory/
dialog/
lowlevel/
the agent process will instantiate these and pass it down to the respective modules (for example : passing the dialog model to the dialog
module)
and if we do it at the component level, it will look like :
perception/
base/
artifacts/
models/
datasets/
craftassist/
artifacts/
models/
datasets/
robots/
artifacts/
models/
datasets/
memory/
dialog/
base/
artifacts/
models/
datasets/
lowlevel/
I have a slight preference for the first one but we can discuss both (or any other way of organizing these)
the droidlet/ folder is generally installed as a package onto the system via python setup.py install
, so a droidlet/artifacts
directory is generally not writeable without sudo because it will be installed to something like /usr/local/lib/site-packages/...
Right now, we are using python setup.py develop
as a temporary workflow, so the local droidlet is the same as the installed package.
So, putting artifacts inside the droidlet folder is not a good idea.
We have two options:
artifacts/
folder alongside droidlet
and agents
.droidlet.set_artifact_dir(...)
$DROIDLET_HOME
env variable if available$XDG_CACHE_HOME
env variable if available~/.cache/droidlet
as the default path
Instead of passing the path to artifacts
through all function levels, we can have a droidlet.get_artifact_dir(...)
that walks through these prioritized resolutions to the path, and returns the path to the artifacts dir.Reference: https://pytorch.org/docs/stable/hub.html#where-are-my-downloaded-models-saved
what do you think?
Oh yes, that's a good point about separating this to be outside of the directory. I have a slight preference for the first option because it's simpler but the second one looks neater. In the spirit of the refactor, in both cases, we also need some heirarchical directory structure like:
artifacts/
base/
models/
semantic_parser/
datasets/
semantic_parsing_datasets/
craftassist/
models/ (any craftassist models here)
datasets/
locobot/
models/
datasets/
Tangential - that said, I think we should also refactor our data and model download scripts after this big refactor. Thinking of this as a user who, for example, wants to write a new agent, we should only force download some must-have artifacts (for example: parser) but not all , or even make it completely optional.
I just re-thought, (1) is not feasible.
Basically, if you run the unit tests or the agent from else-where (even worse, someone builds an agent in their own repository), then determining "the root directory" is not obvious / feasible. So, we'll have to do (2)
Here's some guidelines for Phase 2:
class Foo(...):
def __init__(*args, **kwargs):
# these are okay
self.state = load_state_from_disk(...)
self.sql_db = sqlite.connect(...)
# these are not okay
self.history_of_conversations = [] # to be updated every time a new conversation is processed by Foo
# somewhere in between, where you have to make a judgement
# here, you would rather actually pass the update_frequency as an
# argument into the processing function, such as Foo.process_dialog(text, update_frequency)
self.update_frequency = args[0]
Strive to make components purely functional (or static-functional).
If the component doesn't need any static state, make it a function rather than a class.
Let's expand unit test coverage for every part we make purely functional.
The natural thing to happen is that all state now has to be routed via tha agent
in the agents/ folder. In this phase, don't worry about making the agents ugly in code. We will spend phase-3 restructuring agents.
Dependencies: as a natural consequence of (1) and (3), our dependency diagram above will start emerging (except, the only change is that tasks.py can depend on lowlevel
). Enforce the dependency structure via unit tests and imports, i.e. memory
and perception
depend on nothing else for example, and the imports in the implementation files and unit tests should reflect that.
Lastly, I'll take Perception. @kavyasrinet wanted to take Memory. I'll let Yuxuan, you determine what you want to take.
Posting a high level workflow from interpreter -> tasks here just as an FYI based on discussion with @aszlam this morning: Example chat : "spawn a pig"
DialogManager
:
input: "spawn a pig"
output:
{"dialogue_type": "HUMAN_GIVE_COMMAND",
"action_sequence": [
{"action_type": "SPAWN",
"reference_object": {
"filters": {
"triples": [{"pred_text": "has_name", "obj_text": [0, [2, 2]]}]
}
}
}
]
}
input:
{"dialogue_type": "HUMAN_GIVE_COMMAND",
"action_sequence": [
{"action_type": "SPAWN",
"reference_object": {
"filters": {
"triples": [{"pred_text": "has_name", "obj_text": [0, [2, 2]]}]
}
}
}
]
}
output: TaskNode
The way interpreter does this is: first read the logical form and then call handle_spawn (Note that this mapping of logical form -> handler call is done here. This handler in turn, reads through the subdict and returns maybe_task_list_to_control_block . In this case since we just want to spawn one pig (in other words - there is no repeats eg: "spawn 5 pigs" or composite actions eg: "spawn a pig and then come here") , the function returns a single task with task_list and added here.
Note that this is a Spawn
task, defined here and returns a MemoryNode associated with this task.
TaskNode
s are maintained in memory and here is how the agent steps through the tasks.
This is a fairly involved rearchitecting of
droidlet
, so I'll provide as much context as possible. I've discussed this with @kavyasrinet and @aszlam who both approved this direction.Motivation
There are several components of droidlet that are useful by themselves. Perception, memory, low-level control primitives and dialog models are components that are independently useful in many projects without having to build a monolithic agent.
There are several projects in active learning, unsupervised learning, AR/VR, dialog assistants that can be formulated with only a subset of components available in droidlet. Yet, as of today, we need to create monolithic agents to do research on any of these topics.
The current state of the codebase doesn't encode that independence assumption. Almost every component of droidlet is dependent on another, and all of them are dependent on having an opaque access pattern via
self.agent
.Here is how the dependency graph of droidlet looks like as of today:
This pattern is almost universally present. Here are some examples:
The second issue with this monolithic state is that testing and debugging have become fairly hard, because of action-at-a-distance effects. From our unit-testing, to our end-to-end testing via oncalls, we see the effects of interdependence. @kavyasrinet mentions that it is common during minor refactors as well.
The Proposal
The proposal is simple in nature, though the work involved is a lot of both mechanical refactoring and redesigning several APIs to be nicer in this new world.
perception
,memory
anddialog
are Independent components that have no dependencies to other components in the project.Here is the folder structure for them in the root directory.
They will install as
droidlet.perception
,droidlet.memory
,droidlet.dialog
, ...interpreter
will depend on memory. We will completely clean out the notion of acontroller
which was often confused with/interchangably used withinterpreter
, and instead just haveinterpreter
.The
interpreter
tests needmemory
fills, i.e. entries in memory to test functionality correctly. So, interpreter unit tests will need some smallagents
(not in the old traditional sense with a monolithic class structure, but just a quick function that creates memory entries).Lastly, there will be an
agents
folder that has one file per agent of a certain characteristic.The agents will be a single file each, and can be "unstructured" or "structured". More on this later in this proposal.
Memory schema and Interpreter Dictionary
There are three monolithic memory schemas that exist right now:
Either the minecraft or locobot Level-1 schemas can be composed upon the droidlet base schema which we call a Level0 schema.
While we could try to over-modularize the memory-schemas into "perception schema", "arm schema", etc., we aren't aiming to do so in the current proposal.
The end state we'd like to see in this current refactor is to keep this composable leveling pattern, and if needed, formalize a "leveling" or "composablity", so that one knows either from documentation or via a runtime error which schemas are composable on top of others. For example, it should be encoded that
minecraft
depends on / composes on top of thedroidlet
schema.Similarly, we have dictionaries in the canonical interpreter representation, where some base words are universal, but some specific words are related to capabilities of a given environment or agent.
For example,
move
is a common word in all agents (but doesn't have to be), butgrasp
orpinch
might be related to specific agent capabilities. Some environments can have new capabilities, for exampleswim
.We will have a base interpreter dictionary, and then have subsequent composable levels. This is again not too far from the status quo, so I won't detail it here. There is no plan to break the vocabulary down into further hierarchical composable vocabularies than the current scheme we have.
Pseudo-code
Individual Components
Once we get rid of
self.agent
andself.memory
in almost all the other components, this means two things:agent
enablesFor example, using the object detector would simply be:
To use something like the object tracker that is stateful, one does something like this:
Similarly, dialogue manager won't store the previous conversational state within itself, and will be purely functional:
Agent code
Simple functional agent
This results in a full-scale agent doing mostly state-passing and event-loop processing.
A simple agent would be a giant ~200 line function that creates all necessary components of the agent and defines how they interact with each other.
The only state that is being pushed or pulled outside of this code block is when the interpreter directly interacts with memory. All other state is explicitly carried through the
agent
implementation here.This might be slightly more verbose, but it does benefit from two productivity boosters:
Pub-Sub agent
I haven't fully fleshed out the design for this style of agent, but, as an opt-in, you can use this style when dealing with complex projects. In some future distribution, we could even replace a trivial but not-as-performant pub-sub implementation with a much tighter C++-optimized one.
Final design here still TBD.
This pub-sub model doesn't make a ton of sense in this toy example, but it could start making sense in fairly complex projects where you do need to break many parts of this logic into hundreds of lines or even separate files.
Breakdown of tasks
I haven't fully mapped the order in which PRs will appear, but I will keep them incremental and hopefully easy to review.