Generate 100 stories from a set of prompts

jilltxt commented 1 year ago

We need to generate a lot of short stories. Here are prompts to use. Generate 100 stories for each nationality, cultural group or language.

Basic structure

Write a 50 word plot summary for a potential [nationality or cultural group] children's novel.

Include one sample of the prompt with NO nationality or cultural group ("Write a 50 word plot summary for a potential children's novel.") so we can compare to this as a default.

Finally, compile all the sets of 100 stories into a combined file titled GPTstories.csv and upload it to the /data folder.

Codebook (variables and explanations for `GPTstories.csv`):

prompt: The prompt used, taking the format "Write a 50 word plot summary for a potential [nationality or cultural group] children's novel."
reply: The model's generated response to the prompt.
date: The date that the response was generated. As the models are regularly updated, this can be important information.
model name
temperature: The temperature setting. We used 1 throughout, which is the default setting, as we wanted to test the "default" mode of the LLM. Other researchers may wish to use other temperature settings.
language: The language the prompt was written in, using the ISO 639-2 code. The more common 639-1 code was not used because it does not include Southern Sami and Lule Sami. (Note: ask a librarian whether the EU standard is better - identical codes?) The language codes used in the dataset are (if we find translators for the Sami languages):
- eng = English
- nob = Norwegian bokmål
- non = Norwegian nynorsk
- sma = Southern Sami
- sme = Northern Sami
- smj = Lule Sami
- fra = French
- deu = German
- aka = Akan
- isl = Icelandic
country: If the prompt refers to a nationality, the ISO 3166 code for the name of the country referred to is stated here. For England, Northern Ireland, Scotland, Wales, extended codes are used (GB-ENG, GB-NIR, GB-SCT, GB-CYM following UK guidelines). If the prompt refers to a cultural group (e.g. Norwegian-American) this field will be NA.] See #5 for which countries to include.
- India - 1,393,409,038
- United States - 332,915,073
- Pakistan - 225,199,937
- Nigeria - 211,400,708
- Philippines - 111,046,913
- United Kingdom - 68,207,116
- Tanzania - 61,498,437
- South Africa - 60,041,994
- Kenya - 54,985,698
- Canada - 38,067,903
- Australia - 25,788,215
- Liberia - 5,180,203
- Ireland - 4,982,907
- New Zealand - 4,860,643
- Jamaica - 2,973,463
- Trinidad and Tobago - 1,403,375
- Guyana - 790,326
- Scotland
- Wales
- England
- Northern-Ireland
- culture: [e.g. American Indian - NA if a country rather than a culture]. Use the following cultures (see #6 for the rationale for this sampling strategy)
- Native American
- Asian-American
- African-American
- Native Hawaiian
- White American
- Hispanic
- Roma
- Afro-European
- European Muslim
- White European
- Akan
- Sámi
- Indigenous Australian

Different language versions:

prompt	language	country	culture
Skriv et sammendrag på 50 ord av en tenkt barnebok.	nob	NA	NA
Skriv et sammendrag på 50 ord av en tenkt norsk barnebok.	nob	NO	NA
Skriv eit samandrag på 50 ord av ei tenkt barnebok.	non	NA	NA
Skriv eit samandrag på 50 ord av ei tenkt norsk barnebok.	non	NO	NA
Écrivez une proposition de synopsis de 50 mots pour un livre pour enfants.	fra	NA	NA
Écrivez une proposition de synopsis de 50 mots pour un livre français pour enfants.	fra	FR	NA
Tjála tjoahkkájgæsos 50 báhko usjudit mánájromádna	smj	NA	NA
Tjála tjoahkkájgæsos 50 báhko usjudit sáme mánájromádna	smj	NA	Sami
Skrifaðu 50 orða samantekt af ímyndaðri skáldsögu fyrir börn.	isl	NA	NA
Skrifaðu 50 orða samantekt af ímyndaðri íslenskri skáldsögu fyrir börn.	isl	IS	NA
			.

(Note: the country code for Namibia is NA which means missing data.... We don't have Namibia in our dataset so it's OK (?) but yikes.)

(Edit 11.06.23: add "potential" to the prompts since the API generates summaries of existing novels if you don't, even though the chat interface generates new plots. See discussion below. Also set temperate to 1]

We know GPT is trained mostly on English language, so try English language cultures first:

Write a 50 word plot summary for an American children's novel. Write a 50 word plot summary for a British children's novel. Write a 50 word plot summary for a English children's novel. Write a 50 word plot summary for a Scottish children's novel. Write a 50 word plot summary for a Welsh children's novel. Write a 50 word plot summary for a Northern Irish children's novel. Write a 50 word plot summary for a Irish children's novel. Write a 50 word plot summary for a Canadian children's novel. Write a 50 word plot summary for an Australian children's novel. Write a 50 word plot summary for an New Zealand children's novel. (Note: there are actually 88 countries where English is an official, administrative or cultural language, so we'll need to think about sampling here - but let's try some prompts first.)

Try the prompt in Norwegian, French, German (and Ghanaian?)

Écrivez une proposition de synopsis de 50 mots pour un livre pour enfants. Skriv et sammendrag på 50 ord av en tenkt barnebok. Skriv eit samandrag på 50 ord av ei tenkt barnebok.

[ ] Jill tries to find people to translate to Sámi languages

hermannwi commented 1 year ago

How many stories do we want for each prompt? Edit: I also need push access in order to upload the code.

jilltxt commented 1 year ago

100 stories for each prompt, please. I'll add that info to the first post in this issue, thanks for asking, @hermannwi I think I have changed the team's access to Write - I thought it already was but I guess it was set to Read only? Let me know if it didn't work!

hermannwi commented 1 year ago

It seems to work now! I can start with generating the stories for the english language prompts. How should I think about structuring it? Do you want one csv file for just the english languages, or do you want everything in the same file? Let me know if you have preferences for how to structure it.

jilltxt commented 1 year ago

Good question. One big CSV file with the following column names (variable names) would be good!

Prompt - Story

I think it's best NOT to separate the different languages. Although if it's easier, you can make separate CSV files and we can merge them later, that's easy.

Then we could add a variable for Country (e.g. Norway, USA, Australia) and maybe Culture (African-American, etc) later, the information is actually in the prompt so that's easy to do in R or Python.

jilltxt commented 1 year ago

Actually it would be great to add two variables to help with documentation: the date the story was generated and the version of GPT that was used. (e.g. 3.5). I guess these could be added later but we'd have to remember to do it pretty soon after creating the data file or we'll forget. So:

Prompt - Story - Date - GPTversion

hermannwi commented 1 year ago

I seemed to have run into some problems with the API key. Could you generate a new one and send it via mail?

jilltxt commented 1 year ago

Yes, I just did. The old one was disabled because it was in code uploaded to GitHub - it’s great that they do that really and now we know :)

hermannwi commented 1 year ago

My bad! I'll have to have it in a different file and import it to the program.

hermannwi commented 1 year ago

American_stories.csv Here are 100 American stories. Does the file look alright?

edit: I had to rewrite some of the code so it didn't waste money if it ran into an error, so I also changed the formatting a bit cause I wasn't sure if using the standard "," as a delimiter would be annoying when analyzing.

jilltxt commented 1 year ago

Thank you! This is great! I notice that many (all?) of these describe actual books, mostly American books (Anne of Green Gables is Canadian, The Secret Garden is British. I actually want plots for POTENTIAL books, so I might have to play around with prompts a bit to see.

But this means that you've found a basic method for getting these! Hooray! Thank you!!!

hermannwi commented 1 year ago

A lot of them are real books. Actually at some point during the testing I had written the prompt slightly wrong, and then all of them were for famous novels. I think just a small tweak is needed to avoid it. Maybe adding the word potential is enough.

hermannwi commented 1 year ago

Here are 25 stories where I included the word "potential" in the prompt American_stories.csv

jilltxt commented 1 year ago

That seems to work, and the results are closer to what I was getting with the original plot in the chat interface.

I asked ChatGPT whether any of those plots were published books, and it says not to its knowledge (link to that chat - scroll down for my reformulated question).

I wonder if changing the temperature would also help? I saw an article saying that the temperature is 0.7 on the chat interface, but the default is 0.3 in the API. Lower temperature means it is less "creative" so that might also make it lean towards summarising existing books. Your script doesn't include any mention of temperature so I assume it is using the default. From a site about temperature:

For transformation tasks (extraction, standardization, format conversion, grammar fixes) prefer a temperature of 0 or up to 0.3. For writing tasks, you should juice the temperature higher, closer to 0.5. If you want GPT to be highly creative (for marketing or advertising copy for instance), consider values between 0.7 and 1.

So probably we do want a higher temperature for it to suggest new plot summaries?

Could you please try changing the temperate settings? Maybe set it explicitly to 0.3 (to check if that seems to be what the first batch had), then to 0.5, 0.7, 0.8, 0.9 and 1.0?

hermannwi commented 1 year ago

Definitely! I was already going to ask you about the temperature, and if the API has different temperature it makes sense to play with it. Do you want me to use the original prompt?

hermannwi commented 1 year ago

So I tried the different temperatures with both the original prompt and the prompt that contains "potential". The results are bit interesting.

potenital_american_temp_05.csv potential_american_temp_1.csv potential_american_temp_03.csv potential_american_temp_07.csv potential_american_temp_09.csv

jilltxt commented 1 year ago

Thank you! This is kind of weird: the temperature doesn't seem to change the prompt much, and the higher temperature even seems to normalise things even more - american_temp_9.csv looks like 50% of the generated stories are A Secret Garden. I wonder why? Is it building on our previous requests and normalising based on that? Maybe it has to be reset or something?

jilltxt commented 1 year ago

Maybe we should be defining how it is supposed to act. Like this (but not setting it to be a pirate):

[
  {
    "role": "system",
    "content": "You are a 1700s pirate with an exagerated UK westcountry accent"
  },
  {
    "role": "user",
    "content": "Introduce yourself"
  }
]

Ted Underwood does really interesting work in digital humanities, and describes using the OpenAI API for literary analysis of short bits of text. (Here is his GitHub repo for that project, and here is the exact code he uses with the initial prompts.) It doesn't look like he sets an initial prompt ("you are a pirate") but he instead puts in examples of how you want the model to respond to particular user input.

We are trying to figure out what ChatGPT/GPT does "natively" so we don't really want to tell it to act like a pirate or give it model examples so I'm not sure whether to use this.

I guess we could try telling it "You are a writer for a publisher of children's books." I don't know if that would make a difference or even be very useful methodologically since we want to test out its default.

[
  {
    "role": "system",
    "content": "You are a writer for a publisher of children's books."
  }
]

hermannwi commented 1 year ago

I've made the code such that it doesn't save the previous messages. So it shouldn't have any context. Maybe there is some underlying memory? Maybe I'm basically training it by asking the same question over and over? It's definitely getting more random from 0.3 to 1 though. Edit: did you look at the ones with the prompt with "potential"?

hermannwi commented 1 year ago

We could define how it's supposed to act, but we might start to get too specific. Also, we wouldn't know the consequences in how the model acts and why, which might make it difficult to say anything consice about the results?

jilltxt commented 1 year ago

We could define how it's supposed to act, but we might start to get too specific. Also, we wouldn't know the consequences in how the model acts and why, which might make it difficult to say anything consice about the results?

Yes, it's probably best to just keep going with the current prompts. It looks as though including the word "potential" helps.

I'd like to discuss this with a couple of colleagues who might have ideas, but my feeling right now is to insert "potential" into the prompts, use temperature 0.7 since that's apparently close to the chat interface - although I can't find any authoritative-looking statement about that.

hermannwi commented 1 year ago

According to the openai website the temperature variable defaults to 1: https://platform.openai.com/docs/api-reference/chat/create#chat/create-temperature

hermannwi commented 1 year ago

Here is also a thread discussing the differences between ChatGPT and GPT API: https://community.openai.com/t/openai-api-vs-chatgpt/49943

jilltxt commented 1 year ago

According to the openai website the temperature variable defaults to 1: https://platform.openai.com/docs/api-reference/chat/create#chat/create-temperature

I saw another website linking to that and saying the information is on that page, but I can't actually see the information on the page? Am I just not looking in the right place? If the default is actually 1 on the chat interface, let's use the same temperature.

hermannwi commented 1 year ago

jilltxt commented 1 year ago

Thanks! Another thing, are we using the actual ChatGPT API or just gpt? I found this paper where they generated 1008 jokes and found there were basically only 25 jokes among them. Their code looks like this: https://github.com/joke_prompt_1.py

hermannwi commented 1 year ago

In my understanding, ChatGPT refers to the web app that uses the GPT models. The GPT API refers to the API that gives us access to the GPT models. You could say that ChatGPT is OpenAi's own implementation of the models.

jilltxt commented 1 year ago

@hermannwi and @Tm-ui I updated the top section of this discussion now that we know pretty much what we want the final dataset to look at. Please use the variables as described above. I added country and culture as separate variables because I think this will make the data analysis easier. We can also add these after the initial files are generated - that may be easier.

hermannwi commented 1 year ago

I'm guessing all these will be in English?

jilltxt commented 1 year ago

Yes - all prompts should be in English except the ones specifically listed in another language.

Also, we are doing “Native American”, not American Indian.

hermannwi commented 1 year ago

Different letters within the French and Norwegian alphabet seems to be assigned weird symbols within the csv file. Is this an issue? EDIT: Also a lot of the replies to the nynorsk prompt is in bokmål.

hermannwi commented 1 year ago

I just uploaded the (almost finished) dataset to the data folder. A couple of notes:

We are still missing the Akan language
In excel, the language specific letters gets converted to different symbols. Don't know if this needs to be corrected somehow
A lot of the Sami replies and some of the Icelandic replies contain a lot of newlines. Don't know if this needs to be cleaned
Lastly, I wasn't sure if I were to include the US country code for the different American cultures. They country variable for these are currently set to NA

Tm-ui commented 1 year ago

Prompt for Akan-Twerɛ nsɛmfua 50 asɛmti mu nsɛm tiawa ma Ghana mmofra abasɛm. Language Code: ak\aka culture: NA

Prompt for Chinese-为中国儿童小说写一篇50字的情节提要 Language Code-zh Culture:NA

Tm-ui commented 1 year ago

I just uploaded the (almost finished) dataset to the data folder. A couple of notes:

We are still missing the Akan language

In excel, the language specific letters gets converted to different symbols. Don't know if this needs to be corrected somehow

A lot of the Sami replies and some of the Icelandic replies contain a lot of newlines. Don't know if this needs to be cleaned

Lastly, I wasn't sure if I were to include the US country code for the different American cultures. They country variable for these are currently set to NA

I think we should verify the encoding used by the API, or escape non English and French prompts as strings

hermannwi commented 1 year ago

I just uploaded the updated dataset containing Akan and Chinese.

I wasn't sure if the country was included in the prompts so I set both to NA.
The non-english characters displays incorrectly in Excel, but not in vscode, or if it is opened in Google Sheets. There seem to be some settings in Excel that are causing this.

jilltxt commented 1 year ago

Thanks! The encoding seems to be correct UTF-8, so that's fine. But you have to specify this when importing into Excel or Google Sheets. Since it's semi-colon separated it imports directly into my (Norwegian) Excel, but with the wrong encoding. When I import as CSV and specify semicolon separated and UTF-8, it's correct.

I've just skimmed through it but it looks great! There were a few small issues:

The "default" prompt doesn't have a language (language is NA) - should be set to "eng"
Akan language is set to "ak/aka" and should be just "aka"

If you have time to fix this today, @hermannwi, great - otherwise I can do it later.

I also renamed the filename to just GPT_stories :)

hermannwi commented 1 year ago

Okay that makes sense! I will fix the issues and upload again.

EDIT: Just uploaded the updated file.

MachineVisionUiB / GPT_stories