Questions about the evaluation strategy

Yahiy commented 1 month ago

https://github.com/DigiRL-agent/digirl/blob/d918012ab47c98b2d448b168848c6f6f1936a1e5/digirl/environment/android/evaluate.py#L211

    def __call__(self, last_two_images, intent: str) -> bool:
        """
        last_two_images: a list of two image path. [last_image_path, second_last_image_path]
        intent: a string representing the user's intent

        Returns:
        - True if the task is completed
        - False otherwise

        If there's an error, it will return False and print the error message
        """
        with Image.open(last_two_images[0]) as img1_src, Image.open(last_two_images[1]) as img2_src:   
            img1 = np.array(img1_src)
            img2 = np.array(img2_src)
        if np.mean((img1.astype(np.float64) - img2.astype(np.float64))**2) < self.threshold:
            print("skipping evaluation due to same images")
            return 0
        # this is an approximation, but it should be fine to add frequently viewed false negatives
        if self.img_matrix is None:
            self.img_matrix = np.expand_dims(img2, axis = 0)
        # will always trigger after the first time
        else:
            distances = np.mean((self.img_matrix.astype(np.float64) - img2.astype(np.float64))**2, axis = (1,2,3))
            if np.min(distances) < self.threshold:
                print("skipping evaluation due to previously seen image, current img_matrix size: ", self.img_matrix.shape[0])
                return 0
            elif self.img_matrix.shape[0] < self.cache_max:
                self.img_matrix = np.concatenate([self.img_matrix, np.expand_dims(img2, axis = 0)], axis = 0)

        print(f"Task: {intent}, image: {last_two_images[1]}")
        eval_res = self._evaluate(intent, last_two_images[1])

        del img1, img2
        return eval_res

A few questions about the evaluation strategy in the code:

Why does the code skip evaluation when the two images are similar?
Why does the code skip evaluation if img2 has been seen before?
Why is last_two_images[1] the second_last_image_path used for evaluation?

BiEchi commented 1 month ago

Thanks for your question and interest in the evaluation part.

When two images are similar enough, to the extent where only the walltimes (on the left upper corner of the screen) are different or where the progress bars are different, the evaluation shouldn't be different. Because we only halt the simulation when we see a Success, the evaluation of the previous step must be a Failure. Thus, this step should also be a Failure.
Similar as reason of Point 1, if img2 was seen before and the simulation doesn't halt, it indicates a Failure.
Should be a typo. [last_image_path, second_last_image_path] -> [last_image_path, this_image_path]. Please refer to this part. I also updated the comments here just now.

Yahiy commented 1 month ago

Thanks, I get it, it's a step level evaluator, so same imgs means not Success before this step and this img too. I thought it was a trajectory level evaluator.

BiEchi commented 1 month ago

Closing as the problem is solved.

DigiRL-agent / digirl

Questions about the evaluation strategy #13