Some questions of interest regarding the details of Prior training.

Shalev-Lifshitz / STEVE-1

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

https://sites.google.com/view/steve-1

170 stars 12 forks source link

Some questions of interest regarding the details of Prior training. #12

Closed Zhoues closed 4 months ago

Zhoues commented 10 months ago

In the Appendix D.2 section of the paper, the Prior Training section, I understood how Steve-1 collected text-video pairs for training the Prior. I am particularly interested in two points 😄 :

I am curious about how I can obtain the 2000 hand-labeled text examples/10000 augmented text examples because I want to try to have the Steve-1 Agent perform some tasks that are trained but not among those 11 tasks.
How can I use mineclip to retrieve videos, is there a script for this? I am curious about how the offset operation mentioned in the paper is smoothly implemented.

Looking forward to your reply ❤️ @Shalev-Lifshitz

Shalev-Lifshitz commented 4 months ago

Hi there, apologies for the late reply! Unfortunately, for the prior dataset, the only data we have right now is what is downloaded here https://github.com/Shalev-Lifshitz/STEVE-1/blob/main/download_weights.sh#L22C1-L22C102, which only has the embeddings. If you'd like to test STEVE-1 on further tasks, you can see the paper Appendix which contains some other tasks, or you could alternatively ask GPT-4 to generate some simple Minecraft text prompts given a few examples.

Regarding retrieval with MineCLIP, we did not release a script for this. But you can make use of the FAISS library. You would need to compute the MineCLIP embeddings for each frame in your dataset (which you are searching over), and then use FAISS to retrieve the most similar embeddings (see FAISS documentation for more details).

artbelyaev0 commented 2 months ago

In the Appendix D.2 section of the paper, the Prior Training section, I understood how Steve-1 collected text-video pairs for training the Prior. I am particularly interested in two points 😄 :

I am curious about how I can obtain the 2000 hand-labeled text examples/10000 augmented text examples because I want to try to have the Steve-1 Agent perform some tasks that are trained but not among those 11 tasks.

How can I use mineclip to retrieve videos, is there a script for this? I am curious about how the offset operation mentioned in the paper is smoothly implemented.

Looking forward to your reply ❤️ @Shalev-Lifshitz

Hello! Did you get a dataset for training custom VAE ? May be you can share it please ?