Dense reward function and assembe successful check

Xianqi-Zhang commented 1 year ago

Hi, Thanks for your sharing. I have 3 questions:

The dense reward function needs to get recipes (seems to be simplified ground truth) to calculate reward, does this means that compared with other methods of calculating reward (through images, electric clouds, poses, etc.), it has more limitations? It seems that only furniture with the recipe can use dense reward function?
The recipe fixes how each part is assembled and does not take into account the fact that the same blocks can be replaced. Will calculating the reward cause an error?

# * File: Furniture_sawyer_dense.py
# * Class: FurnitureSawyerDenseRewardEnv

def _reset_reward_variables(self):
    self._subtask_step = len(self._preassembled)
    self._used_sites = set()
    for part_idx in range(len(self._preassembled)):
        leg = self._recipe["recipe"][part_idx][0]
        for i in range(len(self._recipe["recipe"])):
            g_l, g_r = f"{leg}_ltgt_site{i}", f"{leg}_rtgt_site{i}"
            if not (g_l in self._used_sites or g_r in self._used_sites):
                self._used_sites.add(g_l)
                self._used_sites.add(g_r)
                break

def _update_reward_variables(self):
        ...

        for i in range(len(self._recipe["recipe"])):
            g_l, g_r = f"{self._leg}_ltgt_site{i}", f"{self._leg}_rtgt_site{i}"
            if g_l not in self._used_sites and g_r not in self._used_sites:
                self._used_sites.add(g_l)
                self._used_sites.add(g_r)
                break

Successful assembly is judged by whether all parts are connected together. Is it possible that even if they are connected together, but it is a mistake? (Assembly errors seem to be common in daily life and cannot be judged by whether they are all connected together.)

# * File: furniture_sawyer_dense.py
# * Class: Furniture
def _set_next_subtask(self) -> bool:
    """ Returns True if we are done with all attaching steps. """
    self._subtask_step += 1
    # * Do not check whether all parts are assembled correctly???
    if self._subtask_step == self._success_num_conn:
        return True
    self._update_reward_variables()
    return False

# * File: furniture.py
# * Class: FurnitureEnv
def _reset(self, furniture_id=None, background=None):
    ...

    if self._num_connects is not None:
        self._success_num_conn = self._num_connects
        self._success_num_conn += len(self._preassembled)
    else:
        self._success_num_conn = len(self._object_names) - 1  # * piece num - 1

Thanks for any reply.

Best regards.

edwhu commented 1 year ago

Yes, the current reward function uses ground truth states from simulation. If you use observations to estimate the state, then the reward function will inherit the noise from estimation, and will be worse. Therefore you should try to minimize estimation error if you go this route.
I am not sure what you mean by "same blocks can be replaced". Can you give an example?
In order for 2 parts to be connected, they must meet a distance and angular accuracy constraint (which you can tune). So even though we just count the number of connected parts here, the connected parts already satisfied the distance / angular accuracy constraints.

Xianqi-Zhang commented 1 year ago

Yes, the current reward function uses ground truth states from simulation. If you use observations to estimate the state, then the reward function will inherit the noise from estimation, and will be worse. Therefore you should try to minimize estimation error if you go this route.

I am not sure what you mean by "same blocks can be replaced". Can you give an example?

In order for 2 parts to be connected, they must meet a distance and angular accuracy constraint (which you can tune). So even though we just count the number of connected parts here, the connected parts already satisfied the distance / angular accuracy constraints.

Thank for your reply.

"same blocks can be replaced": e.g., a table (with 4 sites (s-A, s-B, s-C, s-D)) has 4 legs of the same shape (l-A, l-B, l-C, l-D), then these legs should be interchangeable, i.e., {[l-A, s-A], [l-B, s-B]} should be equal to {[l-A, s-B], [l-B, s-A]}. But the 'recipe' and 'site_recipe' are fixed in recipe files. It seems that the interchangeability of objects of the same shape is not considered in the code. Will calculating the reward cause an error?
I'm very sorry, my previous description may not be clear. I mean that whether a furniture is successful assembled is determined by comparing (a) self._subtask_step == self._success_num_conn (furniture_sawyer_dense.py) (b) self._num_connected == self._success_num_conn (furniture.py)

In (a), the agent must connect pairs which are predefined in recipes, which means {[l-A, s-A], [l-B, s-B]} maybe right, but {[l-A, s-B], [l-B, s-A]} is wrong, although maybe they are actually equal.

In (b), if the agent connects all parts, the final combined result is likely to be wrong as well. For example, when you assemble a Lego toy and attach the arms to the head and the head to the legs, even though it is wrong, all the parts are still connected together.

Best regards.

edwhu commented 1 year ago

Yes, you are correct, the current reward function assumes a fixed ordering of assembly steps and considers each part to be unique (leg 1 is different from leg 2). In other words, we assume a 1:1 mapping between parts for assembly to make the reward function implementation and labeling as simple as possible.

But in reality, there is a many to many mapping between parts - any of the legs can attach to any of the holes, and these are all functionally correct. To implement this, the labeling scheme for the XML and the python code must be updated, but we did not do this to keep things simple.

I believe even in furniture.py, it checks to see if two parts are compatible (e.g. leg can only connect to the holes in the bottom of table) before allowing them to connect. So it shouldn't allow arbitrary connections to be made.

Xianqi-Zhang commented 1 year ago

Thanks for your kindly reply.

I'm sorry that the description below may be a little offensive, but what confuses me is (1) Dense reward needs recipes (simplified ground truth), and different models need to be trained for different furniture. (2) However, the FurnitureSawyerGenEnv (furniture_sawyer_gen.py) can assemble a lot of furniture with recipies. (Better generalization and better performance.) Since this problem can be solved with a hand-designed program, why should we use DL-based methods (IL or RL) to solve it repeatedly?
furniture.py try_connect() check connectors of two parts. I want to known if there are some furniture have more than one final assembled shape. For example, if a chair has a backrest and four legs. The seat part has 2 connection points on the upper side (for the backrest) and 4 connection points on the lower side (for the legs). But the final wrong shape may be that the upper side connects two legs, and the lower side connects the backrest and the other two legs. I would like to know if the above situation is possible to judge whether the current method of judging assembly success is reasonable. Since the environment contains more than 60 furniture, it is very difficult to check them one by one. Does every furniture have only one final shape after assembling all its parts?

Thanks for any reply.

Best regards.

edwhu commented 1 year ago

No problem, these are good questions.

Even though there may exist an hand designed optimal controller for the task, there are still practical and scientific benefits towards investigating learning based controllers.

First, the optimal controller, while it exists, may have several practical constraints. It may require a lot of effort to acquire (e.g. coming up with equations and deriving optimal control), may be brittle (if furniture changes, will need to rederive the optimal control), and may be computationally expensive (sampling based motion planner may need a lot of samples to work).

In contrast, the learning based controllers offer some advantages. The effort in acquiring a good controller is now shifted from human engineering to automated search (although I concur that RL / IL still requires human engineering, e.g. reward function). Learning based methods can generalize and adapt to distribution shift. Finally, learning based methods can be computationally efficient since all the computation is now distilled into a forward pass of a neural network. Note that there is also a rich line of work on integrating hand-designed controllers with RL / IL, so it's not like you have to choose between both.

Finally, let's look at tasks in RL in general. Many of the mujoco tasks can be solved with random search, or are trivially solved by optimal control (cartpole). So why do we still benchmark RL on them? Here, the value is for understanding RL in a scientific manner - if we know the properties of the task and how to solve it, we can better diagnose and improve RL when it fails in these toy tasks. In furniture assembly, we are benchmarking long-horizon tasks for RL.

I think there is some confusion. For the furnitures with dense reward, we assign them with 1:1 mappings, e.g. table leg 1 should connect to slot 1. So there is only 1 final shape.

But if you use the sparse reward env for any furniture, there are multiple final shapes allowed (e.g. table leg 1 -> slot 1, table leg 2 -> slot 1) are all allowed. The way we handle this logic is through the XML labeling scheme, please refer to the paper for details.

Xianqi-Zhang commented 1 year ago

Thank you very much for your reply. Aiming to solve a problem or aiming to improve technology (RL/IL) , this is a question. Perhaps this is also a gap between industry and academia.
Since I want to use the sparse reward env, a clear final shape is very necessary. It seems that I need to define a reward to replace the original one.

Thank you for very much.

Best regards.

clvrai / furniture

Dense reward function and assembe successful check #46