astroinfo-hacks / 2023-galaxy-zoo-llm

Train a large language model on galaxy zoo public conversations
MIT License
0 stars 0 forks source link

Transform the conversations into a human-gpt conversation format #4

Open errai34 opened 1 year ago

errai34 commented 1 year ago

Goal is to create_dataset.py, where we take the dataframe and the images from scrap_notes.py and create a Dataset consisting of:

[ { "id": "000000033471", "image": "000000033471.jpg", "conversations": [ { "from": "human", "value": "What are the colors of the bus in the image?\n" }, { "from": "gpt", "value": "The bus in the image is white and red." }, { "from": "human", "value": "What feature can be seen on the back of the bus?" }, { "from": "gpt", "value": "The back of the bus features an advertisement." }, { "from": "human", "value": "Is the bus driving down the street or pulled off to the side?" }, { "from": "gpt", "value": "The bus is driving down the street, which is crowded with people and other vehicles." } ] } ]

errai34 commented 1 year ago

We created two scripts, get_descriptions.py and get_conversations.py which use two models (text-davinci-003) and (gpt-3.5-turbo) to a) answer the question "what can you see in this image?" and b) generate 2 Qs and A from the Galaxy Zoo conversations.

1) We also need to "upgrade" the prompts to the right format.

2) Merge these two files into a master file that takes as input the different prompts and the model name.