Automated statement labeling

markwhiting commented 1 year ago

Check how GPT labels statements on our labeling task. Use $Global R^2 = 1-\frac{mse(prediction,actual)}{mse(baseline,actual)}$ to score, and we can visualize in observable.

Would be nice to see how we do on each question.

markwhiting commented 1 year ago

Originally posted by @amirrr in https://github.com/Watts-Lab/Commonsense-Platform/issues/86#issuecomment-1757692680

Dimensions of statement and their definition:

behavior

Social: it refers to beliefs, perceptions, preferences, and socially con- structed rules that govern human experience; it can be “real” or opinion, but is intrinsically of human origins. e.g., I exist and am the same person I was yesterday. He yelled at me because he was angry. There are seven days in the week.
Physical: it refers to objective features of the world as described by, say, physics, biology, engineering, mathematics or other natural rules; it can be measured empirically, or derived logically. e.g., Men on average are taller than women. The Earth is the third planet from the Sun. Ants are smaller than Elephants.

everyday

Everyday: people encounter, or could encounter, situations like this in the course of their ordinary, everyday experiences, e.g., Touching a hot stove will burn you. Commuting at rush hour takes longer. It is rude to jump the line.
Abstract: this claim refers to regularities or conclusions that cannot be observed or arrived at solely through individual experience, e.g., Capitalism is a better economic system than Communism. Strict gun laws save lives. God exists

figure_of_speech

Figure of speech: it contains an aphorism, metaphor, hyperbole, e.g., Birds of a feather flock together. A friend to all is a friend to none.
Literal language: it is plain and ordinary language that means exactly what it says. e.g. The sky is blue. Elephants are larger than dogs. Abraham Lincoln was a great president.

judgment

Normative: it refers to a judgment, belief, value, social norm or convention. e.g., If you are going to the office, you should wear business attire,not a bathing suit. Treat others how you want them to treat you. Freedom is a fundamental human right.
Positive: it refers to something in the world such as an empirical regularity or scientific law, e.g., hot things will burn you; the sun rises in the east and sets in the west.

opinion

Opinion: it is something that someone might think is true, or wants others to think is true, but can’t be demonstrated to be objectively correct or incorrect; it is inherently subjective. e.g., FDR was the greatest US president of the 20th Century.. The Brooklyn Bridge is prettier than the Golden Gate. Vaccine mandates are a tolerable imposition on individual freedom.
Factual: it is something that can be demonstrated to be correct or incorrect, independently of anyone’s opinion, e.g., the earth is the third planet from the sun (this is correct and we know it is correct), Obama was the 24th president of the United States (this is incorrect, but we know it’s incorrect). It will be sunny next Tuesday (we don’t yet know if this is correct, but we will be able to check in the future).

reasoning

Knowledge: the claim refers to some observation about the world; it may be true or false, opinion or fact, subjective or objective e.g., The sun rises in the east and sets in the west. Dogs are nicer than cats. Glasses break when they are dropped.
Reasoning: the claim presents a conclusion that is arrived at by combining knowledge and logic, e.g., The sun is in the east, therefore it is morning. My dog is wagging its tail, therefore it is happy. The glass fell off the table, therefore it will break and the floor will become wet.

Also:

Non GPT based approach

Feature based model

leave one category out 
for each feature in [behavior ... ]: 
  training_data = data[category != LOO_category]
  test_data = data[category == LOO_category]
  model: feature ~ embedding on training_data
  predict: feature ~ embedding on test_data
  baseline: mode(feature) on training_data

Category based model

leave one design point out (a particular combo of features) 
multinomial regression: category ~ embedding
(same style testing regime)

Could try mean or mode for baseline.

Model type can be random forest or XGBoost

GPT approach

Just ask GPT the questions?

amirrr commented 1 year ago

Here's the data for the non-GPT approach (leaving out the category "Society and social sciences" since it resulted in the most accurate model).

Feature	Random Forest	XGBoost	GPT
behavior	-0.333	-1.475	0.097
everyday	0.092	0.476	0.056
figure_of_speech	0.0	0.156	0.105
judgment	0.034	-0.487	-0.097
opinion	-0.022	-0.126	0.165
reasoning	0.028	0.502	-0.128

markwhiting commented 1 year ago

Great, so we seem to need to do better, hahaha. Also, I think it's fine to trim these to 3 decimal places (e.g., -0.333), and we probably only need $R^2$.

Perhaps we can look at columns like (where each one shows the $R^2$ for that model for each feature): Random forest XGBoost GPT

markwhiting commented 1 year ago

I edited your comment a bit more to indicate what I was thinking. (for some reason Github doesn't send notifications for edits)

amirrr commented 1 year ago

The table is complete now. I ran the GPT labeling against the first 2000 of the statements. Refer to this issue for more details about the prompt.

markwhiting commented 1 year ago

Thanks. Interesting. We're not doing very well.

Just so I understand, how are you doing the score calculation for GPT?

Would you mind making a second table that shows F1 scores for each of these as well?

amirrr commented 1 year ago

Feature	GPT Mean	GPT Mode
behavior	0.498	0.097
everyday	0.509	0.056
figure_of_speech	0.105	0.000
judgment	0.529	0.035
opinion	0.497	-0.022
reasoning	0.585	0.028

markwhiting commented 1 year ago

Interesting! Would you mind adding doing that for the others too? Just to see if our scores there get a lot better?

amirrr commented 1 year ago

These are the Jaccard accuracy, f1 and global $R^2$ (with baseline being average of scores) scores for Random Forest (RF) and XGBoost methods on labeling statements.

Feature	RF Jaccard Score	RF F1 Score	RF Global R-squared	XGBoost Jaccard Score	XGBoost F1 Score	XGBoost Global R-squared	GPT F1 Score	GPT Global R-squared
behavior	0.934	0.966	-0.497	0.950	0.950	-1.594	0.794	0.498
everyday	0.511	0.677	-0.733	0.674	0.674	0.083	0.791	0.509
figure_of_speech	0.028	0.054	-0.055	0.182	0.182	0.084	0.402	0.105
judgment	0.939	0.968	-0.031	0.963	0.963	-0.588	0.772	0.529
opinion	0.891	0.943	-0.184	0.940	0.940	-0.252	0.769	0.497
reasoning	0.438	0.609	-0.826	0.623	0.623	0.056	0.776	0.585

markwhiting commented 1 year ago

How interesting. So none of these is really good enough for everything, though most are OK on some of the features.

One more way we could look at this: for each of these samples, can you balance the data so that the training data have an equal number of each category for each feature.

After taking out a test split, take the smaller group and sample the larger group by the number of items the smaller group has.

markwhiting commented 1 year ago

@joshnguyen99 — can you also put updates here when you have them. Also, we have a project for this https://github.com/orgs/Watts-Lab/projects/27/views/5 and we can start tracking some of our efforts there.

joshnguyen99 commented 1 year ago

Sorry for the late message—there were some bugs in my training scripts but I managed to fix them.

I tried two models, RoBERTa-base (124M) and RoBERTa-large (355M) and added a binary classification module on top of each model. RoBERTa-large unsurprisingly outperformed in all cases.
For each model and each feature, I searched among 4 learning rates and chose one with the lowest validation loss.
Each model was trained using 10 epochs, and I employed early stopping to avoid overfitting.

Below is the performance by the best model for each feature. I held out 10% of the dataset, stratified by the predicted feature.

Feature	Precision	Recall	F1	AUROC
behavior	0.768	0.833	0.799	0.725
everyday	0.630	1.000	0.773	0.474
figure_of_speech	0.641	0.294	0.403	0.741
judgment	0.790	0.897	0.840	0.749
opinion	0.635	1.000	0.777	0.564
reasoning	0.619	1.000	0.765	0.608

We have a minor improvement in the F1 score for figure_of_speech and reasoning compared to RF (what model was this again @amirrr?)

Compared to GPT it's pretty much the same. But I don't think we used the same test set, so it might be worth it to sync up.

I have also added you to the repo for my finetuning scripts (https://github.com/joshnguyen99/commonsense-playground). Also added you to my wandb project to keep track of finetuning if you're interested.

markwhiting commented 1 year ago

Great, thanks! Can you move that repo into the Watts-lab org — we like to keep stuff centralized where possible.

Interesting that the simple models seem to be doing best overall still.

I think RF is random forest with embedding as features.

Why don't we set up a common train-test split code, so we can do repeatable splits. I think @amirrr was working on running RF and XGBoost with a balanced training set, which we think will dramatically help on the figure_of_sepech $F_1$. Can you share those results too @amirrr.

markwhiting commented 1 year ago

One more thing, we are doing a lot of lm related stuff in commonsense-lm. Not sure if we want to share that space between our GPT explorations and other models, but I think ultimately we probably want a singular place for it all. We can talk through logistics of that next week.

joshnguyen99 commented 1 year ago

@markwhiting — Sure, I can move the code to Watts-Lab! For now, I will commit it to a folder within commonsense-statements, next to Amir's training scripts. Let's talk about how LLM-related code can be organized under one big repo when we meet.

joshnguyen99 commented 1 year ago

OK, I might have found something in @amirrr's code that led to very different results from mine.

In dimension-prediction.ipynb, you used this to perform train-test-split:

for outcome in outcomes:
    X_train = merged_df[merged_df['category'] != 'Society and social sciences'].embeddings
    y_train = merged_df[merged_df['category'] != 'Society and social sciences'][outcome]

    X_test = merged_df[merged_df['category'] == 'Society and social sciences'].embeddings
    y_test = merged_df[merged_df['category'] == 'Society and social sciences'][outcome]

For example, if outcome == "behavior", then

The training set contains all statements not in the "Society and social sciences" bucket
The test set contains only "Society and social sciences" statements.

This is not entirely random, and we have very different percentages of positive examples in the training and test sets:

behavior
  Train: 63.4% (2503/3950)
  Test : 95.4% (436/457)

everyday
  Train: 63.7% (2518/3950)
  Test : 57.1% (261/457)

figure_of_speech
  Train: 20.7% (818/3950)
  Test : 7.9% (36/457)

judgment
  Train: 69.9% (2763/3950)
  Test : 93.7% (428/457)

opinion
  Train: 60.6% (2394/3950)
  Test : 89.9% (411/457)

reasoning
  Train: 63.1% (2494/3950)
  Test : 52.7% (241/457)

In other words, the training and test subsets aren't stratified.

amirrr commented 1 year ago

Got it. We are going to fix this by sorting statements into two groups based on their embeddings and their cosine similarity and then match them according to whatever we are trying to model. This should make sure the training and testing groups are more balanced, which could help with more accuracy. I will share the train and test in the same repository.

amirrr commented 1 year ago

Results with the balanced dataset test:

feature	xgb_accuracy	xgb_f1	xgb_R-squared	rf_accuracy	rd_f1	rf_R-squared
behavior	0.588	0.740	0.691	0.571	0.727	0.514
everyday	0.522	0.686	0.656	0.499	0.665	0.426
figure_of_speech	0.295	0.455	-0.053	0.304	0.466	-0.456
judgment	0.563	0.720	0.695	0.541	0.702	0.491
opinion	0.570	0.726	0.671	0.553	0.712	0.476
reasoning	0.501	0.667	0.650	0.462	0.631	0.393

joshnguyen99 commented 11 months ago

@amirrr and @markwhiting, I have uploaded the 6 roberta-large models (for predicting dimensions) to HuggingFace.

They can be found in our lab's HF page: https://huggingface.co/CSSLab

You can try the Inference API on the right-hand side.

joshnguyen99 commented 10 months ago

Here's the performance of a multi-label classifier. I used the non-chat version of TinyLlama (1.1B params) and fine-tuned it once using the multi-label version of our dataset.

Feature	Precision	Recall	F1	AUROC
behavior	0.843	0.548	0.664	0.744
everyday	0.711	0.507	0.592	0.669
figure_of_speech	0.667	0.291	0.405	0.769
judgment	0.774	0.533	0.631	0.594
opinion	0.762	0.552	0.641	0.675
reasoning	0.691	0.786	0.735	0.645
Micro	0.747	0.564	0.643	0.724
Macro	0.741	0.536	0.611	0.683

Overall this looks better than the RoBERTa models used before, but not significantly. The perk here is that this is only one multi-label model instead of six binary classifiers.

(I also fine-tuned LLaMA-2 7B for this task but it actually performs worse than TinyLlama. I suspect it's mostly because the dataset is small relative to the model size, evidenced by the the high variance achieved relatively during fine-tuning.)

markwhiting commented 10 months ago

Thanks, any proposals to get a better result? I feel like at this point using something like that for some properties and using @amirrr's lates models for others seems like it gets us the best overall rate, though it would be great if we could put it all into a single model like the design you have while achieving a winning quality on every dimension.

markwhiting commented 5 months ago

Moving this to statements repo

Watts-Lab / commonsense-statements