[ENH] Tutorial illustrating EDSL working with Qualtrics survey

andreifoldes commented 1 month ago

Thank you for this awesome package.

I think it would benefit the users a lot if there was an example of how survey responses could be generated for a given Qualtrics QSF/Word export file.

rbyh commented 1 month ago

We completely agree! We have a module Conjure that allows you to import survey data and recreate it in EDSL: https://docs.expectedparrot.com/en/latest/conjure.html If you can export as a CSV, Conjure can already be used to support that.

Conjure is very much a work in progress though--we'd love your suggestions/requests. Do you have any sample Qualtrics data files we can work with?

andreifoldes commented 1 month ago

Ah I see, so maybe one pipeline is simply to just use Qualtrics's built in feature to generate some synthetic data and feed that to the Conjure object?

There is one Qualtrics data that I'm working on, but its longitudinal with multiple waves, with complicated logic. I do believe that one can access a sea of .qsf files at OSF. I was browsing it and this looked interesting and easy for example: https://osf.io/z9tnc

This resource is kinda cool because sometimes one can find the human responses in the repository as well and then its easy to compare the LLMs to humans :)

rbyh commented 1 month ago

You can use EDSL to simulate responses to your Qualtrics survey questions. If you have a list of the questions it is relatively straightforward to format them in EDSL either manually or by prompting ChatGPT to do it for you (include a link to EDSL docs in the prompt). I did this with Cooperative Election Study Common Content public survey data: https://docs.expectedparrot.com/en/latest/notebooks/ces_data_edsl.html

rbyh commented 1 month ago

Thanks for sharing that OSF link -- I'll create a new notebook for it :)

rbyh commented 1 month ago

Here's a notebook with questions reformatted in EDSL. I included quick example code for adding agents and running the survey -- let me know if at all useful!

https://www.expectedparrot.com/content/f865ec42-4294-417c-8f75-40d528abf1fe

andreifoldes commented 4 weeks ago

Amazing work, thank you so much. So when using LLM to "transform" the qsf file to EDSL format, how much hand holding is required. Can you do one prompt and insert it all or you do it in chunks? Would be amazing if large context LLMs could do it in one go..

Maybe including a quick section on the prompt used could be helpful for people as part of the tutorial?

rbyh commented 4 weeks ago

Thanks! Yes, there is definitely back-and-forth depending on the amount of instruction and amount of extraneous text/code. Here's an example prompt (and I'll add it to the notebook): https://chatgpt.com/share/52166f72-0752-4ff8-ae34-c7e4d1043265 It goes faster if I also include an example straight from the content to be reformatted.

Something we'll add to Conjure and the Coop is the ability to use a model to do cognitive testing on a survey, to flag potentially problematic questions or content you may want to change. For example, you can see at a glance there are several questions in this survey that aren't really questions or that you do not want to bother sending to the model as-is:

q_QID101 = QuestionFreeText( question_name = "QID101", question_text = "Before we get started, please answer some questions about yourself." )

If we send it to a (good) model it will return "None" or a response about why it can't answer the question, but it's a waste of tokens.

rbyh commented 4 weeks ago

Hmm chatgpt link is not working for me -- here is the prompt that worked very well for me. I was able to run it with half of the original code:

Please extract the questions from the following survey code and reformat them in edsl (a python library: https://docs.expectedparrot.com). Here are some examples of questions formatted in edsl:

q_QID44 = QuestionMultipleChoice( question_name = "QID44", question_text = "Acknowledging another's technical assistance in publication without that person's permission.", question_options = [ "Completely Indefensible", "Moderately Indefensible", "Somewhat Indefensible", "Neither Defensible nor Indefensible", "Somewhat Defensible", "Moderately Defensible", "Completely Defensible" ] )

q_QID21 = QuestionLinearScale( question_name = "QID21", question_text = "For a researcher, how important is choosing a sample size before running a study?", question_options = [1, 2, 3, 4, 5, 6, 7], option_labels = {1:"Not At All Important", 7:"Very Important"} )

Here is the code to use:

andreifoldes commented 4 weeks ago

Thanks - I'll also share with the Word formatted version of the same file. I just imported the .qsf and exported it as Word using its own converter. Ethics_Training_Time_1_Version_1.docx

Maybe it's useful to actually use the Word version for Qualtrics users than because its less token-hungry?

rbyh commented 4 weeks ago

Thanks! Yes, that could be the case re docs and tokens. That's a great suggestion that we design for Qualtrics Word version.

The CES survey data (notebooks I mentioned ^) only had PDFs of the surveys online, but similar prompts worked fine when I copied and pasted that text. I also had to batch it for Chat GPT, but I think having a few back-and-forth prompts on a small section to start can be fastest.

johnjosephhorton commented 4 weeks ago

It's not perfect, but I wrote some code to parse the QSF file directly:

https://www.expectedparrot.com/content/2dac60b9-ad85-4c73-854e-7d775531f8d5

It still has some html cruff (e.g., br), it's not turning linear scale questions into linear scales, plus there are some questions that don't work with our validators e.g., the single option consent checkbox fails because we require at least two options. They also have the notion of a 'question' that is just text with instructions which we don't (currently) support but could.


import json
import html
import re

from edsl import Question
from edsl import Survey

qualtrics_codes = {
    "TE": "free_text",
    "MC": "multiple_choice",
}
# TE (Text Entry): Allows respondents to input a text response.
# MC (Multiple Choice): Provides respondents with a list of options to choose from.
# DB (Descriptive Text or Information): Displays text or information without requiring a response.
# Matrix: A grid-style question where respondents can evaluate multiple items using the same set of response options.

def clean_html(raw_html):
    # Unescape HTML entities
    clean_text = html.unescape(raw_html)
    # Remove HTML tags
    clean_text = re.sub(r"<.*?>", "", clean_text)
    # Replace non-breaking spaces with regular spaces
    clean_text = clean_text.replace("\xa0", " ")
    # Optionally, strip leading/trailing spaces
    clean_text = clean_text.strip()
    return clean_text

class SurveyQualtricsImport:

    def __init__(self, qsf_file_name: str):
        self.qsf_file_name = qsf_file_name
        self.question_data = self.extract_questions_from_json()

    def create_survey(self):
        survey = Survey()
        for question in self.question_data:
            if question["question_type"] == "free_text":
                try:
                    q = Question(
                        question_type="free_text",
                        question_text=question["question_text"],
                        question_name=question["question_name"],
                    )
                except Exception as e:
                    print(f"Error creating free text question: {e}")
                    continue
            elif question["question_type"] == "multiple_choice":
                try:
                    q = Question(
                        question_type="multiple_choice",
                        question_text=question["question_text"],
                        question_name=question["question_name"],
                        question_options=question["question_options"],
                    )
                except Exception as e:
                    print(f"Error creating multiple choice question: {e}")
                    continue
            else:
                # raise ValueError(f"Unknown question type: {question['question_type']}")
                print(f"Unknown question type: {question['question_type']}")
                continue

            survey.add_question(q)

        return survey

    def extract_questions_from_json(self):
        with open(self.qsf_file_name, "r") as f:
            survey_data = json.load(f)

        questions = survey_data["SurveyElements"]

        extracted_questions = []

        for question in questions:
            if question["Element"] == "SQ":
                q_id = question["PrimaryAttribute"]
                q_text = clean_html(question["Payload"]["QuestionText"])
                q_type = qualtrics_codes.get(question["Payload"]["QuestionType"])

                options = None
                if "Choices" in question["Payload"]:
                    options = [
                        choice["Display"]
                        for choice in question["Payload"]["Choices"].values()
                    ]

                extracted_questions.append(
                    {
                        "question_name": q_id,
                        "question_text": q_text,
                        "question_type": q_type,
                        "question_options": options,
                    }
                )

        return extracted_questions

if __name__ == "__main__":
    survey_creator = SurveyQualtricsImport("example.qsf")
    survey = survey_creator.create_survey()
    info = survey.push()
    print(info)
    # questions = survey.extract_questions_from_json()
    # for question in questions:
    #    print(question)

andreifoldes commented 4 weeks ago

Awesome! Relatedly, in the current versions of EDSL.. the kind of Qualtrics Introductory text descriptions ("study description/consent form") - would those be used as prompts for the agent personas?

Maybe I could expand on your code so the user is flagged when submitting questions that aren't implemented in EDSL?

rbyh commented 4 weeks ago

That would be fantastic if you want to iterate on the code!

Re: intro texts -- there are multiple options and considerations.

By default, questions are administered asynchronously. You can add skip/stop/other rules and "memories" of other questions and answers within the same survey (docs page on this). If you add memories of all prior questions in a section, then you could put a section intro in the first question text of the section and it will be present as the context builds.
But if you want to keep questions asynchronous, then you can either put intros in the agent instruction field or create a separate survey for each section, combining results afterwards.

expectedparrot / edsl

[ENH] Tutorial illustrating EDSL working with Qualtrics survey #870