Open andreifoldes opened 2 months ago
the code I'm using, but have observed this in other another survey script
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--subject_num', type=int, help='number of subjects')
args = parser.parse_args()
subject_num = args.subject_num
#%%
import os
os.environ['ANTHROPIC_API_KEY'] = 'your_key_here'
os.environ['DEEP_INFRA_API_KEY'] = 'your_key_here'
os.environ['GOOGLE_API_KEY'] = 'your_key_here'
os.environ['OPENAI_API_KEY'] = 'your_key_here'
#%%
import numpy as np
from edsl import Agent, QuestionMultipleChoice, QuestionFreeText, QuestionLinearScale, Survey
# function that uniformly samples an integer from 18 to 35
import numpy as np
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--subject_num', type=int, help='number of subjects')
args = parser.parse_args()
subject_num = args.subject_num
#%%
import os
#%% custom fun
import numpy as np
from edsl import Agent, QuestionMultipleChoice, QuestionFreeText, QuestionLinearScale, Survey
# function that uniformly samples an integer from 18 to 35
import numpy as np
#%% main fun
import pandas as pd
def analyze_games(file_path, top_n):
# Load the CSV file into a DataFrame
df = pd.read_csv(file_path)
# Calculate the number of games for each unique value in the 'Genres' column
genres_counts = df['Genres'].value_counts()
# Calculate the number of games for each unique value in the 'Categories' column
categories_counts = df['Categories'].value_counts()
# Sample the top N values for 'Genres'
top_genres = genres_counts.nlargest(top_n)
# Convert the top genres to a list
top_genres_list = top_genres.index.tolist()
return top_genres_list
# Example usage
file_path = r"...\synthetic_data\source_data\unzip\games.csv"
top_n = 50
top_genres_list = analyze_games(file_path, top_n)
# print("Top Genres:")
# print(top_genres_list)
#%%
def sample_age():
age_distribution = {
18: 0.035,
19: 0.036,
20: 0.037,
21: 0.038,
22: 0.039,
23: 0.040,
24: 0.041,
25: 0.042,
26: 0.043,
27: 0.044,
28: 0.045,
29: 0.046,
30: 0.047,
31: 0.048,
32: 0.049,
33: 0.050,
34: 0.051,
35: 0.052,
36: 0.053
}
# Normalize probabilities
# Convert the dictionary to lists for numpy's random.choice
ages = list(age_distribution.keys())
probabilities = list(age_distribution.values()) # Define probabilities here
# Normalize probabilities
total_probability = sum(probabilities)
probabilities = [p / total_probability for p in probabilities]
# Sample an age based on the probabilities
return np.random.choice(ages, p=probabilities)
# function that uniformly samples from US states
def sample_state():
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"]
return np.random.choice(states)
# function to uniformly sample between Male / Female / Non-binary / Prefer not to say
import random
def sample_gender():
options = ["Male", "Female", "Non-binary", "Prefer not to say"]
weights = [0.4, 0.4, 0.1, 0.1] # Adjust weights as needed
return random.choices(options, weights=weights, k=1)[0]
import random
def sample_gender():
options = ["Male", "Female", "Non-binary", "Prefer not to say"]
weights = [0.4, 0.4, 0.1, 0.1] # Adjust weights as needed
return random.choices(options, weights=weights, k=1)[0]
def sample_employment():
options = [
"Working full-time",
"Working part-time",
"Not currently employed",
"A homemaker or stay-at-home parent",
"Student",
"Retired",
"Other"
]
# Adjust weights as needed to skew the sampling
weights = [0.3, 0.2, 0.15, 0.1, 0.1, 0.1, 0.05]
return random.choices(options, weights=weights, k=1)[0]
def sample_edu_level():
options = [
"Some Primary",
"Completed Primary School",
"Some Secondary",
"Completed Secondary School",
"Vocational or Similar",
"Some University but no degree",
"University Bachelors Degree",
"Graduate or professional degree (MA, MS, MBA, PhD, etc)",
"Prefer not to say"
]
# Adjust weights as needed to skew the sampling
weights = [0.05, 0.1, 0.15, 0.2, 0.1, 0.1, 0.15, 0.1, 0.05]
return random.choices(options, weights=weights, k=1)[0]
import random
gamer_personas = {
"The Hardcore Competitor": "Lives for the challenge, strives for victory, and invests heavily in gear and practice. (e.g., Esports players, ranked players)",
"The Casual Gamer": "Plays for relaxation and fun, prefers less demanding games, and plays in short bursts. (e.g., Mobile game players, puzzle game enthusiasts)",
"The Completionist": "Must achieve 100% completion in every game, meticulously explores every corner and collects every item.",
"The Story-Driven Gamer": "Primarily interested in narrative and character development, enjoys immersive and emotional experiences.",
"The Innovator": "Loves experimenting with new mechanics and genres, enjoys early access games and testing new features.",
"The Social Butterfly": "Primarily plays for the social interaction, enjoys multiplayer games and team-based activities.",
"The Achievement Hunter": "Driven by unlocking achievements and trophies, enjoys the sense of accomplishment and bragging rights.",
"The Explorer": "Enjoys open world games and discovering hidden secrets, prefers exploration over combat or objectives.",
"The Strategist": "Loves complex strategy games and meticulous planning, enjoys the challenge of outsmarting opponents.",
"The Role-Player": "Immerses themselves in their chosen character, enjoys making choices and influencing the story.",
"The Tech Enthusiast": "Fascinated by the technical aspects of gaming, enjoys pushing hardware to its limits and modding games.",
"The Retro Gamer": "Prefers classic games and consoles, enjoys the nostalgia and simplicity of older titles.",
"The Collector": "Collects physical copies of games, consoles, and merchandise, enjoys the tangible aspect of gaming history.",
"The Speedrunner": "Focuses on completing games as quickly as possible, enjoys the challenge and optimization of gameplay.",
"The Content Creator": "Creates videos, streams, or other content related to gaming, enjoys sharing their passion with others.",
"The Problem Solver": "Enjoys puzzle games and challenging brain teasers, enjoys the satisfaction of overcoming difficult obstacles.",
"The Artist": "Enjoys games with strong visual aesthetics or creative elements, appreciates the artistry and design of games.",
"The Escapist": "Uses gaming as a way to escape from reality and stress, enjoys immersive worlds and engaging gameplay.",
"The Novice Gamer": "New to gaming and exploring different genres, enjoys learning new mechanics and discovering their preferences.",
"The Mobile-Only Gamer": "Exclusively plays games on mobile devices, enjoys the convenience and accessibility of mobile gaming."
}
def random_gamer_persona():
"""
Randomly samples and returns a gamer persona and its description from the dictionary.
"""
persona = random.choice(list(gamer_personas.keys()))
description = gamer_personas[persona]
return description
study_intro = """This document outlines a research study on gaming behavior conducted by the Oxford Internet Institute in collaboration with Nintendo and Microsoft. The study involves collecting anonymized gameplay data from consenting Nintendo, Xbox, and Steam users to understand their gaming habits. Participation is voluntary, and users can withdraw at any time. The study emphasizes the protection of user privacy, outlining the methods used to de-identify and secure data."""
def sample_type_of_day():
options = [
"Regular work day",
"Regular day off",
"Weekend",
"Holiday",
"Vacation day"
]
weights = [0.4, 0.3, 0.1, 0.1, 0.1] # Adjust weights as needed
return random.choices(options, weights=weights, k=1)[0]
# %%
agent = Agent(
traits = {
"persona":f"You are US citizen who plays video games on Nintendo, Xbox, and Steam. You are about to participate in a study that seeks to understand the gaming behavior of players on these platforms. Study info {study_intro}",
"age":f"{sample_age()}",
"gender":f"{sample_gender()}",
"location":f"{sample_state()}",
"employment":f"{sample_employment()}",
"education":f"{sample_edu_level()}",
"gamer_persona": f"{random_gamer_persona()}"
},
instruction = "Answer each question honestly with respect to your own personal views.",
)
#%% create a list of agent
def create_agents(N):
agents = []
for _ in range(N):
agent = Agent(
traits={
"persona": f"You are US citizen who plays video games on Nintendo, Xbox, and Steam. You are about to participate in a study that seeks to understand the gaming behavior of players on these platforms. Study info {study_intro}",
"age": f"{sample_age()}",
"gender": f"{sample_gender()}",
"location": f"{sample_state()}",
"employment": f"{sample_employment()}",
"education": f"{sample_edu_level()}",
"gamer_persona": f"{random_gamer_persona()}",
"type_of_day": f"{sample_type_of_day()}"
},
instruction="Answer each question honestly with respect to your own personal views."
)
agents.append(agent)
return agents
# Example usage
agents_list = create_agents(subject_num)
# for agent in agents_list:
# print(agent.traits)
# %% Questions
# from edsl.questions import QuestionMultipleChoice, QuestionLinearScale, QuestionFreeText, QuestionNumerical
# # playedGames Block
# q_played0 = QuestionMultipleChoice(
# question_name="played24hr",
# question_text="In the last 24 hours, between 0 AM and 1 AM have you spent any time playing video games?",
# question_options=["Yes", "No"]
# )
# q_playedgenre0 = QuestionMultipleChoice(
# question_name="genre",
# question_text="What genre best describes the type of game played between between 0 AM and 1 AM?",
# question_options=top_genres_list
# )
# q_playedminutes0 = QuestionNumerical(
# question_name="sd_2_minute",
# question_text="How many minutes have you played between 0 AM and 1 AM?",
# min_value=0,
# max_value=600
# )
#%%
from edsl.questions import QuestionMultipleChoice, QuestionLinearScale, QuestionFreeText, QuestionNumerical
from edsl import Survey # Assuming Survey is imported from edsl.survey
# Assuming top_genres_list is defined elsewhere in your code
# List to store all questions
all_questions = []
# Generate questions for each hour
for hour in range(24):
start_time = f"{hour:02d}:00"
end_time = f"{(hour + 1) % 24:02d}:00"
# Played games question
exec(f"""
q_played{hour} = QuestionMultipleChoice(
question_name="played{hour:02d}hr",
question_text="In the last 24 hours, between {start_time} and {end_time} have you spent any time playing video games?",
question_options=["Yes", "No"]
)
all_questions.append(q_played{hour})
""")
# Genre question
exec(f"""
q_playedgenre{hour} = QuestionMultipleChoice(
question_name="genre{hour:02d}",
question_text="What genre best describes the type of game played between {start_time} and {end_time}?",
question_options=top_genres_list
)
all_questions.append(q_playedgenre{hour})
""")
# Minutes played question
exec(f"""
q_playedminutes{hour} = QuestionNumerical(
question_name="minutes{hour:02d}",
question_text="How many minutes have you played between {start_time} and {end_time}?",
min_value=0,
max_value=600
)
all_questions.append(q_playedminutes{hour})
""")
# Create the survey with all questions
survey = Survey(questions=all_questions)
# Add skip logic for each hour
for hour in range(24):
# Skip genre question if "No" was selected for played question
survey = survey.add_skip_rule(f"genre{hour:02d}", f"played{hour:02d}hr == 'No'")
# Skip minutes question if "No" was selected for played question
survey = survey.add_skip_rule(f"minutes{hour:02d}", f"played{hour:02d}hr == 'No'")
# # Example of how to access the survey
# print(f"Total number of questions in the survey: {len(survey.questions)}")
# print(f"First question in the survey: {survey.questions[0].question_text}")
# print(f"Last question in the survey: {survey.questions[-1].question_text}")
# print("Skip logic has been added to all genre and minutes questions.")
#%%
from edsl import Model
Model.available()
model = Model("gpt-4o-mini")
model.parameters['temperature'] = 0.7
#%%
survey_full_1= survey.set_lagged_memory(6)
results = survey_full_1.by(agents_list).by(model).run(progress_bar=True)
# get current date and time in str
from datetime import datetime
import argparse
now = datetime.now()
# make it filename friendly
now = now.strftime("%Y-%m-%d_%H-%M-%S")
results.to_pandas().to_csv(f"synthetic_data/data/steam_{now}.csv")
# %%
def sample_age():
age_distribution = {
18: 0.035,
19: 0.036,
20: 0.037,
21: 0.038,
22: 0.039,
23: 0.040,
24: 0.041,
25: 0.042,
26: 0.043,
27: 0.044,
28: 0.045,
29: 0.046,
30: 0.047,
31: 0.048,
32: 0.049,
33: 0.050,
34: 0.051,
35: 0.052,
36: 0.053
}
# Normalize probabilities
# Convert the dictionary to lists for numpy's random.choice
ages = list(age_distribution.keys())
probabilities = list(age_distribution.values()) # Define probabilities here
# Normalize probabilities
total_probability = sum(probabilities)
probabilities = [p / total_probability for p in probabilities]
# Sample an age based on the probabilities
return np.random.choice(ages, p=probabilities)
# function that uniformly samples from US states
def sample_state():
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"]
return np.random.choice(states)
# function to uniformly sample between Male / Female / Non-binary / Prefer not to say
import random
def sample_gender():
options = ["Male", "Female", "Non-binary", "Prefer not to say"]
weights = [0.4, 0.4, 0.1, 0.1] # Adjust weights as needed
return random.choices(options, weights=weights, k=1)[0]
import random
def sample_gender():
options = ["Male", "Female", "Non-binary", "Prefer not to say"]
weights = [0.4, 0.4, 0.1, 0.1] # Adjust weights as needed
return random.choices(options, weights=weights, k=1)[0]
def sample_employment():
options = [
"Working full-time",
"Working part-time",
"Not currently employed",
"A homemaker or stay-at-home parent",
"Student",
"Retired",
"Other"
]
# Adjust weights as needed to skew the sampling
weights = [0.3, 0.2, 0.15, 0.1, 0.1, 0.1, 0.05]
return random.choices(options, weights=weights, k=1)[0]
def sample_edu_level():
options = [
"Some Primary",
"Completed Primary School",
"Some Secondary",
"Completed Secondary School",
"Vocational or Similar",
"Some University but no degree",
"University Bachelors Degree",
"Graduate or professional degree (MA, MS, MBA, PhD, etc)",
"Prefer not to say"
]
# Adjust weights as needed to skew the sampling
weights = [0.05, 0.1, 0.15, 0.2, 0.1, 0.1, 0.15, 0.1, 0.05]
return random.choices(options, weights=weights, k=1)[0]
import random
gamer_personas = {
"The Hardcore Competitor": "Lives for the challenge, strives for victory, and invests heavily in gear and practice. (e.g., Esports players, ranked players)",
"The Casual Gamer": "Plays for relaxation and fun, prefers less demanding games, and plays in short bursts. (e.g., Mobile game players, puzzle game enthusiasts)",
"The Completionist": "Must achieve 100% completion in every game, meticulously explores every corner and collects every item.",
"The Story-Driven Gamer": "Primarily interested in narrative and character development, enjoys immersive and emotional experiences.",
"The Innovator": "Loves experimenting with new mechanics and genres, enjoys early access games and testing new features.",
"The Social Butterfly": "Primarily plays for the social interaction, enjoys multiplayer games and team-based activities.",
"The Achievement Hunter": "Driven by unlocking achievements and trophies, enjoys the sense of accomplishment and bragging rights.",
"The Explorer": "Enjoys open world games and discovering hidden secrets, prefers exploration over combat or objectives.",
"The Strategist": "Loves complex strategy games and meticulous planning, enjoys the challenge of outsmarting opponents.",
"The Role-Player": "Immerses themselves in their chosen character, enjoys making choices and influencing the story.",
"The Tech Enthusiast": "Fascinated by the technical aspects of gaming, enjoys pushing hardware to its limits and modding games.",
"The Retro Gamer": "Prefers classic games and consoles, enjoys the nostalgia and simplicity of older titles.",
"The Collector": "Collects physical copies of games, consoles, and merchandise, enjoys the tangible aspect of gaming history.",
"The Speedrunner": "Focuses on completing games as quickly as possible, enjoys the challenge and optimization of gameplay.",
"The Content Creator": "Creates videos, streams, or other content related to gaming, enjoys sharing their passion with others.",
"The Problem Solver": "Enjoys puzzle games and challenging brain teasers, enjoys the satisfaction of overcoming difficult obstacles.",
"The Artist": "Enjoys games with strong visual aesthetics or creative elements, appreciates the artistry and design of games.",
"The Escapist": "Uses gaming as a way to escape from reality and stress, enjoys immersive worlds and engaging gameplay.",
"The Novice Gamer": "New to gaming and exploring different genres, enjoys learning new mechanics and discovering their preferences.",
"The Mobile-Only Gamer": "Exclusively plays games on mobile devices, enjoys the convenience and accessibility of mobile gaming."
}
def random_gamer_persona():
"""
Randomly samples and returns a gamer persona and its description from the dictionary.
"""
persona = random.choice(list(gamer_personas.keys()))
description = gamer_personas[persona]
return description
study_intro = """This document outlines a research study on gaming behavior conducted by the Oxford Internet Institute in collaboration with Nintendo and Microsoft. The study involves collecting anonymized gameplay data from consenting Nintendo, Xbox, and Steam users to understand their gaming habits. Participation is voluntary, and users can withdraw at any time. The study emphasizes the protection of user privacy, outlining the methods used to de-identify and secure data."""
# %%
agent = Agent(
traits = {
"persona":f"You are US citizen who plays video games on Nintendo, Xbox, and Steam. You are about to participate in a study that seeks to understand the gaming behavior of players on these platforms. Study info {study_intro}",
"age":f"{sample_age()}",
"gender":f"{sample_gender()}",
"location":f"{sample_state()}",
"employment":f"{sample_employment()}",
"education":f"{sample_edu_level()}",
"gamer_persona": f"{random_gamer_persona()}"
},
instruction = "Answer each question honestly with respect to your own personal views.",
)
#%% create a list of agent
def create_agents(N):
agents = []
for _ in range(N):
agent = Agent(
traits={
"persona": f"You are US citizen who plays video games on Nintendo, Xbox, and Steam. You are about to participate in a study that seeks to understand the gaming behavior of players on these platforms. Study info {study_intro}",
"age": f"{sample_age()}",
"gender": f"{sample_gender()}",
"location": f"{sample_state()}",
"employment": f"{sample_employment()}",
"education": f"{sample_edu_level()}",
"gamer_persona": f"{random_gamer_persona()}"
},
instruction="Answer each question honestly with respect to your own personal views."
)
agents.append(agent)
return agents
# Example usage
agents_list = create_agents(subject_num)
# for agent in agents_list:
# print(agent.traits)
# %%
# sleep diary Block
q_sd_0 = QuestionMultipleChoice(
question_name="sd_0",
question_text="Which best describes today for you?",
question_options=[
"Regular work day",
"Regular day off",
"Weekend",
"Holiday",
"Vacation day",
"Other (please specify):"
]
)
# %% combine
#%%
from edsl import Model
Model.available()
model = Model("gpt-4o-mini")
model.parameters['temperature'] = 0.7
#%%
# survey_full_1= survey.set_lagged_memory(2)
survey_full_1= survey
results = survey_full_1.by(agents_list).by(model).run(progress_bar=True)
# get current date and time in str
from datetime import datetime
import argparse
now = datetime.now()
# make it filename friendly
now = now.strftime("%Y-%m-%d_%H-%M-%S")
results.to_pandas().to_csv(f"synthetic_data/data/steam_{now}.csv")
Thank you - this is super helpful. Without digging in yet, my guess there's some data w/ an unexpected encoding being read in---probably less likely that the model is generating something We could add some kind of sanitizing / conversion check. We're on it! cc: @rbyh
Thanks for sending! Could you please share a few mock rows of games.csv
and I will run it on my end?
Because you are getting this error inconsistently I agree that it is likely an issue with characters that are in the data being read in. I have seen a similar issue where someone was creating Scenario
objects (inputs for question texts) that included {
and }
in some text entries. The data was being read in randomly, so the error was also intermittently appearing. In debugging it was helpful to read in data in small batches to pinpoint the offending text.
Thanks again!
games.csv is data from this public repo: https://www.kaggle.com/datasets/fronkongames/steam-games-dataset
Given how that code is run (via def analyze_games(file_path, top_n)
) - selecting top K entries from games.csv
that should be static.. also in another script this games.csv
dependency is completely absent and I still receive this exact same error intermittently .. happy to include that code as well if helpful..
Did you have this error with other models as well?
Yes it seems - I just tried with model = Model("gemini-pro") model.parameters['temperature'] = 0.7
UnicodeEncodeError: 'charmap' codec can't encode characters in position 22768-22799: character maps to <undefined>
Another temporary solution would be for the subset of results to still go through even when one agent fails when multiple agents get called?
We think it's a Windows issue -- we're working on it!
Interesting! Will move the code to Linux and report back by tomorrow!
Yes, indeed, no error on Linux ^^
Super helpful, thanks for confirming.
Description:
While running surveys with
edsl
, I intermittently encounter aUnicodeEncodeError
with the following message:This error does not occur consistently, making it difficult to pinpoint the exact cause. It seems to be related to encoding issues when processing the survey results.
Steps to Reproduce:
Agent
objects with various traits.gpt-4o-mini
) to run the survey with the agents.results.to_pandas().to_csv()
.Expected Behavior:
The survey should run without any encoding errors and successfully save the results to the CSV file.
Actual Behavior:
Occasionally, a
UnicodeEncodeError
is raised during the execution, preventing the results from being saved.Environment:
0.1.31
3.12.2
Additional Information:
Possible Cause:
It's possible that the issue stems from the language model generating responses with characters that cannot be encoded using the default 'charmap' codec. This could be due to special characters or emojis present in the generated text.
Suggested Solution:
UnicodeEncodeError
and potentially sanitize the responses before saving.Please investigate this issue and provide a solution or workaround to ensure reliable survey execution and result saving.