Closed shuyugan closed 4 months ago
ADD: This is the osworld results when input space is SoM:
Hi. Thanks for reaching out. As mentioned in our paper, all the experiments are done with GPT-4o for its cheap and fast request. So we only compare the performance among Cradle using GPT-4o and baselines reported in OSWorld using GPT-4o rather than all the models. As you can find in the OSWorld paper, GPT-4o is actually not good at these tasks and GPT-4V has a much better performance. If we replace Cradle's backbone with GPT-4V, there should also be a corresponding performance gain. For consistency across all the experiments and taking the cost and time into consideration (GPT-4V is much more expensive and slower than GPT-4o), we only provide the performance of GPT-4o for comparison.
In addition, we also want to emphasize that Cradle does not use the grounding truth SoM labels. Strictly following GCC setting, there is no way to get these labels without having access to built-in APIs. What we use is that we apply the SAM model to segment the screenshots and use the segmentation results to form the SoM-ish labels, which we call SAM2SOM labels. The quality of these labels are much worse than the real SoM labels used in the OSWorld baselines. Many of them are duplicate, inaccurate and unnecessary. Besides, there is also no way to get the corresponding sementic meaning of the labels, which usually serves as part of textual input to the model when applying real SoM method. Even with these noisy labels and without sementic textual input, Cradle can still outperformance the baselines with grounding truth labels. Please refer to our appendix for more details of the SAM2SOM labels.
Hey guys, I really appreciate your work!!!! But I'm still confused that why don't you get enough good results(Like SOTA) on OSworld benchmark. If you could check in the agent backbone provided in osworld, you will find it is too simple and too naiive, which only comprises a single phase of inference, and with no reflection module, memory module mentioned in your work.
I know that your input space when dealing osworld tasks is SoM(Mentioned in your appendix), but your results(7.81%) is not SOTA(11.77%) on overall performance of OSworld.
I understand that your main testbed are games' scenarios, however, due to your sophisticated and advance backbone(6 modules) of agents and meticulously designing of prompts and whole working flow, it is understandable for us to expect a more satisfying results on OSworld, since you guys are aimed to design a foundation agents, not solely for games. What do you think is the bottleneck reason accounting for this, and why your convoluted and convincing agent design still suffer a somewhat bad result on osworld benchmark?
I hope that what I have proposed make sense and I really appreciate if you could reply with an appropriate explanation. It is my honor to witness such a spectacular and magnificent project released! I'm looking forward to your early reply.