MineDojo / Voyager

An Open-Ended Embodied Agent with Large Language Models
https://voyager.minedojo.org/
MIT License
5.34k stars 491 forks source link

Fails with small models including llama3-8b #157

Closed TimeLordRaps closed 4 weeks ago

TimeLordRaps commented 2 months ago

Before submitting an issue, make sure you read the FAQ.md

Briefly describe your issue

I modified the code so that I could run this with any chatmodel on langchain so that I could test it out on multiple different models, this took a few hours so I don't recall everything I had to do to make it work but my diffs indicate:

  1. replacing the self.model_name with the ChatOllama/ChatGrok MODEL so that it only needs to be loaded once, matters more for Ollama, a. then changing the use of self.model_name to be the llm or qa_llm or whatever other llms voyager has defined throughout the agent folders b. then adding invoke to the llm calls.
  2. And I changed out the OAI embeddings for snowflake-arctic-embed-l using langchain_community.embeddings' HuggingFaceEmbeddings

The preliminary results on llama3 are shown below, but I have also tested both codeqwen and wizardlm2 7b models and both seem worse than llama3 as to be expected, though codeqwen writes good code it fails in seemingly the same way I have described llama3-8b failing below.

Please provide your python, nodejs, Minecraft, and Fabric versions here

python 3.10, node 20.11.0, minecraft 1.20.4, fabric 0.15.9

[If applicable] Please provide the Minefalyer and Minecraft logs, you can find the log under logs folder

Too many to list so I will instead note that seemingly llama3-8b fails where llama3-70b succeeds specifically in spatial common-sense task composition reasoning. What I mean is llama3-8b seemingly has a lack of ordering of what needs to be done to accomplish most multi-action steps. It's ability to compose skills seems prohibitive, whereas after >30 iterations llama3-8b failed to make a pickaxe, or crafting table, at least that I could see in its inventory from convo logs. A better highlight of this is the comparison of completed tasks to failed tasks:

completed tasks: ["Mine 1 wood log"]

failed tasks: ["Mine 1-2 dirt blocks", "Mine 3-4 gravel blocks", "Mine 1 diorite block", "Mine 2-3 copper ore blocks", "Mine 1-2 diorite blocks", "Mine 1-2 stone blocks", "Mine 1-2 stone blocks", "Mine 1-2 copper ore blocks", "Mine 1-2 diorite blocks", "Mine 1-2 diorite blocks", "Mine 1-2 diorite blocks"]

I think the decision space for which it wants to progress through is too constrained to high level objectives, and it fails in compositionality of using smaller tasks to build up a skill library, without looking at the specific prompts used it's hard to tell if this is just something overlooked by the voyager creators because gpt4 could just do it de facto or if specific prompt engineering for composition could improve llama3-8b chances.

For comparison here is llama3-70b's tasks after 20 iterations: completed tasks: ["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs"]

failed tasks: ["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe"]

So we see that while llama3-70b still is continuously failing to craft a wooden pickaxe, I'm not sure that will be the case forever so I'll keep running, but it fails on trying to make a pickaxe, llama3-8b fails to try to make a pickaxe. It's a very subtle difference but I think it highlights the difference between the models well in terms of their possible abilities relating to task compositionality.

[If applicable] Please provide the GPT conversations that are printed each round.

Instead, I will provide a note that llama3-8b and seemingly llama3-70b both seem to fail in biomes where oak logs are not immediately available, specifically despite their being say spruce or jungle logs in nearby blocks they still will only write code for oak logs.

The above completed and fail tasks are from relatively simple starting biomes such as forests.

I'll try to update if anything spectacular happens before/around the 100 iterations point and if grok will let me up to the 500 and 1000 iterations points on llama3-70b.

This issue is simply a means of saving anyone time thinking of doing something similar.

update ~40 iterations: completed tasks: ["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs", "Craft a wooden shovel"]

failed tasks: ["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden door", "Craft a stone pickaxe", "Craft a stone shovel", "Craft a stone shovel", "Craft a wooden sword", "Mine 3 stone blocks"]

TimeLordRaps commented 2 months ago

It seems to fail crafting a wooden pickaxe because this: Context: Question: How to craft a wooden pickaxe in Minecraft? Answer: To craft a wooden pickaxe in Minecraft, you need to arrange the following items in a crafting table: 3 wooden planks in a diagonal line from top-left to bottom-right, and 2 sticks in the middle row, one in the second column and one in the fourth column. This will give you a wooden pickaxe.

TimeLordRaps commented 2 months ago

Eventually figured out the wooden pickaxe, I'm going to try to run to as many iterations as I can until I just keep getting continuously hit with rate limits.

Iteration 108: Completed tasks:["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs", "Craft a wooden shovel", "Obtain 3 birch logs", "Mine 3 dirt blocks", "Mine 1 coal ore", "Mine 1 copper ore", "Mine 3 cobblestone blocks", "Obtain 2 oak logs", "Mine 3 stone blocks", "Kill 1 salmon"]

Failed Tasks: ["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden door", "Craft a stone pickaxe", "Craft a stone shovel", "Craft a stone shovel", "Craft a wooden sword", "Obtain 3 cobblestone blocks", "Craft a wooden pickaxe", "Mine 1 stone block", "Smelt 4 copper ore", "Smelt 4 copper ore", "Smelt 1 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Obtain a jungle_log", "Smelt 4 copper ore", "Craft a stone pickaxe"]

TimeLordRaps commented 2 months ago

iteration 272: Completed tasks: ["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs", "Craft a wooden shovel", "Obtain 3 birch logs", "Mine 3 dirt blocks", "Mine 1 coal ore", "Mine 1 copper ore", "Mine 3 cobblestone blocks", "Obtain 2 oak logs", "Mine 3 stone blocks", "Kill 1 salmon", "Equip a stone pickaxe", "Mine 3 coal ore", "Obtain 2 birch logs", "Mine 3 birch logs", "Place a chest", "Craft 4 birch planks", "Mine 5 dirt blocks"]

Failed tasks: ["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden door", "Craft a stone pickaxe", "Craft a stone shovel", "Craft a stone shovel", "Craft a wooden sword", "Obtain 3 cobblestone blocks", "Craft a wooden pickaxe", "Mine 1 stone block", "Smelt 4 copper ore", "Smelt 4 copper ore", "Smelt 1 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Obtain a jungle_log", "Smelt 4 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Craft a stone pickaxe", "Kill 1 squid", "Smelt 4 copper ore", "Craft a stone pickaxe", "Craft a stone sword", "Craft a stone pickaxe", "Smelt 1 copper ore", "Smelt 4 copper ore", "Smelt 3 copper ore", "Cook 1 salmon", "Mine 1 iron ore", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a stone axe", "Craft a wooden pickaxe", "Smelt 4 copper ore", "Craft a wooden pickaxe", "Obtain a jungle log", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 1 birch log", "Kill 1 skeleton", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack"]

TimeLordRaps commented 2 months ago

Around 320 iterations it continuously runs into rate limits:

Completed tasks: ["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs", "Craft a wooden shovel", "Obtain 3 birch logs", "Mine 3 dirt blocks", "Mine 1 coal ore", "Mine 1 copper ore", "Mine 3 cobblestone blocks", "Obtain 2 oak logs", "Mine 3 stone blocks", "Kill 1 salmon", "Equip a stone pickaxe", "Mine 3 coal ore", "Obtain 2 birch logs", "Mine 3 birch logs", "Place a chest", "Craft 4 birch planks", "Mine 5 dirt blocks", "Mine 1 cobblestone block", "Craft 1 Birch Fence"]

Failed tasks:

["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden door", "Craft a stone pickaxe", "Craft a stone shovel", "Craft a stone shovel", "Craft a wooden sword", "Obtain 3 cobblestone blocks", "Craft a wooden pickaxe", "Mine 1 stone block", "Smelt 4 copper ore", "Smelt 4 copper ore", "Smelt 1 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Obtain a jungle_log", "Smelt 4 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Craft a stone pickaxe", "Kill 1 squid", "Smelt 4 copper ore", "Craft a stone pickaxe", "Craft a stone sword", "Craft a stone pickaxe", "Smelt 1 copper ore", "Smelt 4 copper ore", "Smelt 3 copper ore", "Cook 1 salmon", "Mine 1 iron ore", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a stone axe", "Craft a wooden pickaxe", "Smelt 4 copper ore", "Craft a wooden pickaxe", "Obtain a jungle log", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 1 birch log", "Kill 1 skeleton", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 1 Netherrack block", "Mine 5 Netherrack blocks", "Mine 1 Netherrack block", "Craft a Birch Fence", "Craft a stone pickaxe", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Craft a wooden axe", "Craft a Birch Door", "Craft a stone pickaxe", "Craft a Birch Door", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Mine 5 gravel blocks", "Mine 1 Netherrack block", "Craft a stone pickaxe", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Craft a stone axe", "Open the chest at (24, 72, -5)", "Mine 5 cobblestone blocks", "Mine 5 cobblestone blocks", "Open the chest at (24, 72, -5)", "Mine 5 Netherrack blocks", "Mine 5 Netherrack blocks", "Mine 1 iron ore", "Open the chest at (24, 72, -5)", "Mine 5 Netherrack blocks", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Craft a stone pickaxe", "Mine 5 Netherrack blocks", "Open the chest at (24, 72, -5)", "Craft a stone pickaxe"]

I'm currently fine-tuning llama-3 then phi-3 on all the code and descriptions from gpt-4 from the longest voyager checkpoint of 255 iterations, and from the three trials, just to see if either can perform better with fine tuning.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 4 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

SilkePilon commented 1 day ago

@TimeLordRaps im also working on a ollama version, is there any way you can share your code? i cant seem to get mine past the text parsing

TimeLordRaps commented 1 day ago

https://github.com/TimeLordRaps/Voyager-diff-files had to go through and adapt it to use OllamaEmbeddings but otherwises these allow me to run ollama models including embedding models. These are more or less what I used for all my experiments. Also for future viewers, my prompts have been edited quite a bit, so I suggest trying with the default prompts, and using mine as a last chance.

SilkePilon commented 13 hours ago

https://github.com/TimeLordRaps/Voyager-diff-files had to go through and adapt it to use OllamaEmbeddings but otherwises these allow me to run ollama models including embedding models. These are more or less what I used for all my experiments. Also for future viewers, my prompts have been edited quite a bit, so I suggest trying with the default prompts, and using mine as a last chance.

thanks! it worked for some time (its writing non working code tho) but now im getting list index out of range only that... do you have any ideas what this issue is?

TimeLordRaps commented 10 hours ago

I do remember getting that with llama3, I had a run going all night but cat unplugged the run so tha'ts fun, so I'll look through the logs and see if it ever happened and if I can figure out why, I believe it has something with doing multiple codeblocks, or no code blocks. I believe because the code tries to only grab the first one.