Statement Normalization Pipeline using OpenAI

dankim444 commented 3 months ago

Description

This PR introduces a new statement normalization pipeline, cleans the remaining original statements in the raw_statements directory, and introduces minor changes to different files to streamline text extraction (specifically extracting the language code) from filenames. The criteria for normalization is as follows:

The first letter of the statement must be capitalized (if applicable to the language).
Leading and trailing punctuation is removed.
The statement ends in the appropriate full-stop punctuation native to the language.

The normalization pipeline leverages OpenAI, and the news_statements and observable statements were cleaned using gpt-4o while email_statements (due to the size of the files) files were cleaned with gpt-3.5-turbo. During this process, I noticed several differences in performance between the two models. Specifically, gpt-4o was more consistent in not changing the original capitalization of proper nouns, altering the original vocabulary, and not introducing any additional punctuation; whereas gpt-3.5-turbo would make changes despite being explicitly instructed not to in the system prompt. When merged, this PR will close https://github.com/Watts-Lab/commonsense-platform/issues/150, ensuring consistent rendering of statements on the commonsense platform's UI.

New files

normalize_statements_openai.py: script that cleans statements files that have yet to be cleaned in the raw_statements directory.
remove_duplicates_after_normalization.py: script that handles duplicates caused by running the normalize_statements_openai.py script.

Changes

email_statements, news_statements, observable statements
Translate Statements and Remove Any Duplicates workflow: Added a third job 'normalize-statements' that cleans the statement files after they have been translated and removes potential duplicates from translations.
calculate_translation_cost.py: updated the way the language code is extracted from the filename and how filenames are processed.
remove_duplicates.py: minor change to documentation.
show_groups_of_duplicates: removed 'lng' as a column to avoid redundancy.
translate_statements_aws.py: changed how filenames are processed and how language code is extracted.
README.md: included instructions on naming convention of files and translation of files.

Testing

I acted as a "human-in-the-loop" to verify OpenAI's outputs. I used an online Diffchecker tool (https://www.diffchecker.com/) to compare changes made from the original file to the new file. I also used OpenAI playground to verify the system prompt.

Important note

To ensure more consistent output from OpenAI, I recommend using gpt-4o or possibly gpt-4o-mini to normalize the statements. In particular, gpt-3.5-turbo would sometimes remove the capitalization of proper nouns, alter some vocabulary and thereby change the nuanced meaning of some statements, and introduce unintended punctuation. I directly address all these in the system prompt; however, it is open to improvement.

markwhiting commented 3 months ago

Great. Can we switch to 4o for everything? (or have you already)

github-actions[bot] commented 1 month ago

Translation Cost Calculation

cleaned_statements_en.csv still needs to be translated into 9 new languages. This would require translating 12141 characters. It will cost approximately $0.18 to complete these translations.

Watts-Lab / commonsense-statements