Watts-Lab / commonsense-statements

0 stars 0 forks source link

Automated statement labeling #10

Open markwhiting opened 11 months ago

markwhiting commented 11 months ago

Check how GPT labels statements on our labeling task. Use $Global R^2 = 1-\frac{mse(prediction,actual)}{mse(baseline,actual)}$ to score, and we can visualize in observable.

Would be nice to see how we do on each question.

markwhiting commented 11 months ago

Originally posted by @amirrr in https://github.com/Watts-Lab/Commonsense-Platform/issues/86#issuecomment-1757692680

Dimensions of statement and their definition:

behavior

everyday

figure_of_speech

judgment

opinion

reasoning

Also:

category

Which knowledge category or categories describe this claim? (choose all that apply)

markwhiting commented 11 months ago

Note, we should check labels against the original version of the statement, because cleaned statements might need different labels.

Once we have a good labeling strategy we should label all the new clean statements freshly Watts-Lab/commonsense-statements#9

markwhiting commented 11 months ago

Non GPT based approach

Feature based model

leave one category out 
for each feature in [behavior ... ]: 
  training_data = data[category != LOO_category]
  test_data = data[category == LOO_category]
  model: feature ~ embedding on training_data
  predict: feature ~ embedding on test_data
  baseline: mode(feature) on training_data

Category based model

leave one design point out (a particular combo of features) 
multinomial regression: category ~ embedding
(same style testing regime)

Could try mean or mode for baseline.

Model type can be random forest or XGBoost

GPT approach

Just ask GPT the questions?

amirrr commented 11 months ago

Here's the data for the non-GPT approach (leaving out the category "Society and social sciences" since it resulted in the most accurate model).

Feature Random Forest XGBoost GPT
behavior -0.333 -1.475 0.097
everyday 0.092 0.476 0.056
figure_of_speech 0.0 0.156 0.105
judgment 0.034 -0.487 -0.097
opinion -0.022 -0.126 0.165
reasoning 0.028 0.502 -0.128
markwhiting commented 11 months ago

Great, so we seem to need to do better, hahaha. Also, I think it's fine to trim these to 3 decimal places (e.g., -0.333), and we probably only need $R^2$.

Perhaps we can look at columns like (where each one shows the $R^2$ for that model for each feature): Random forest XGBoost GPT

markwhiting commented 11 months ago

I edited your comment a bit more to indicate what I was thinking. (for some reason Github doesn't send notifications for edits)

amirrr commented 10 months ago

The table is complete now. I ran the GPT labeling against the first 2000 of the statements. Refer to this issue for more details about the prompt.

markwhiting commented 10 months ago

Thanks. Interesting. We're not doing very well.

Just so I understand, how are you doing the score calculation for GPT?

Would you mind making a second table that shows F1 scores for each of these as well?

amirrr commented 10 months ago
Feature GPT Mean GPT Mode
behavior 0.498 0.097
everyday 0.509 0.056
figure_of_speech 0.105 0.000
judgment 0.529 0.035
opinion 0.497 -0.022
reasoning 0.585 0.028
markwhiting commented 10 months ago

Interesting! Would you mind adding doing that for the others too? Just to see if our scores there get a lot better?

amirrr commented 10 months ago

These are the Jaccard accuracy, f1 and global $R^2$ (with baseline being average of scores) scores for Random Forest (RF) and XGBoost methods on labeling statements.

Feature RF Jaccard Score RF F1 Score RF Global R-squared XGBoost Jaccard Score XGBoost F1 Score XGBoost Global R-squared GPT F1 Score GPT Global R-squared
behavior 0.934 0.966 -0.497 0.950 0.950 -1.594 0.794 0.498
everyday 0.511 0.677 -0.733 0.674 0.674 0.083 0.791 0.509
figure_of_speech 0.028 0.054 -0.055 0.182 0.182 0.084 0.402 0.105
judgment 0.939 0.968 -0.031 0.963 0.963 -0.588 0.772 0.529
opinion 0.891 0.943 -0.184 0.940 0.940 -0.252 0.769 0.497
reasoning 0.438 0.609 -0.826 0.623 0.623 0.056 0.776 0.585
markwhiting commented 10 months ago

How interesting. So none of these is really good enough for everything, though most are OK on some of the features.

One more way we could look at this: for each of these samples, can you balance the data so that the training data have an equal number of each category for each feature.

After taking out a test split, take the smaller group and sample the larger group by the number of items the smaller group has.

markwhiting commented 10 months ago

@joshnguyen99 — can you also put updates here when you have them. Also, we have a project for this https://github.com/orgs/Watts-Lab/projects/27/views/5 and we can start tracking some of our efforts there.

joshnguyen99 commented 10 months ago

Sorry for the late message—there were some bugs in my training scripts but I managed to fix them.

Below is the performance by the best model for each feature. I held out 10% of the dataset, stratified by the predicted feature.

Feature Precision Recall F1 AUROC
behavior 0.768 0.833 0.799 0.725
everyday 0.630 1.000 0.773 0.474
figure_of_speech 0.641 0.294 0.403 0.741
judgment 0.790 0.897 0.840 0.749
opinion 0.635 1.000 0.777 0.564
reasoning 0.619 1.000 0.765 0.608

We have a minor improvement in the F1 score for figure_of_speech and reasoning compared to RF (what model was this again @amirrr?)

Compared to GPT it's pretty much the same. But I don't think we used the same test set, so it might be worth it to sync up.

I have also added you to the repo for my finetuning scripts (https://github.com/joshnguyen99/commonsense-playground). Also added you to my wandb project to keep track of finetuning if you're interested.

markwhiting commented 10 months ago

Great, thanks! Can you move that repo into the Watts-lab org — we like to keep stuff centralized where possible.

Interesting that the simple models seem to be doing best overall still.

I think RF is random forest with embedding as features.

Why don't we set up a common train-test split code, so we can do repeatable splits. I think @amirrr was working on running RF and XGBoost with a balanced training set, which we think will dramatically help on the figure_of_sepech $F_1$. Can you share those results too @amirrr.

markwhiting commented 10 months ago

One more thing, we are doing a lot of lm related stuff in commonsense-lm. Not sure if we want to share that space between our GPT explorations and other models, but I think ultimately we probably want a singular place for it all. We can talk through logistics of that next week.

joshnguyen99 commented 10 months ago

@markwhiting — Sure, I can move the code to Watts-Lab! For now, I will commit it to a folder within commonsense-statements, next to Amir's training scripts. Let's talk about how LLM-related code can be organized under one big repo when we meet.

joshnguyen99 commented 10 months ago

OK, I might have found something in @amirrr's code that led to very different results from mine.

In dimension-prediction.ipynb, you used this to perform train-test-split:

for outcome in outcomes:
    X_train = merged_df[merged_df['category'] != 'Society and social sciences'].embeddings
    y_train = merged_df[merged_df['category'] != 'Society and social sciences'][outcome]

    X_test = merged_df[merged_df['category'] == 'Society and social sciences'].embeddings
    y_test = merged_df[merged_df['category'] == 'Society and social sciences'][outcome]

For example, if outcome == "behavior", then

This is not entirely random, and we have very different percentages of positive examples in the training and test sets:

behavior
  Train: 63.4% (2503/3950)
  Test : 95.4% (436/457)

everyday
  Train: 63.7% (2518/3950)
  Test : 57.1% (261/457)

figure_of_speech
  Train: 20.7% (818/3950)
  Test : 7.9% (36/457)

judgment
  Train: 69.9% (2763/3950)
  Test : 93.7% (428/457)

opinion
  Train: 60.6% (2394/3950)
  Test : 89.9% (411/457)

reasoning
  Train: 63.1% (2494/3950)
  Test : 52.7% (241/457)

In other words, the training and test subsets aren't stratified.

amirrr commented 10 months ago

Got it. We are going to fix this by sorting statements into two groups based on their embeddings and their cosine similarity and then match them according to whatever we are trying to model. This should make sure the training and testing groups are more balanced, which could help with more accuracy. I will share the train and test in the same repository.

amirrr commented 9 months ago

Results with the balanced dataset test:

feature xgb_accuracy xgb_f1 xgb_R-squared rf_accuracy rd_f1 rf_R-squared
behavior 0.588 0.740 0.691 0.571 0.727 0.514
everyday 0.522 0.686 0.656 0.499 0.665 0.426
figure_of_speech 0.295 0.455 -0.053 0.304 0.466 -0.456
judgment 0.563 0.720 0.695 0.541 0.702 0.491
opinion 0.570 0.726 0.671 0.553 0.712 0.476
reasoning 0.501 0.667 0.650 0.462 0.631 0.393
joshnguyen99 commented 8 months ago

@amirrr and @markwhiting, I have uploaded the 6 roberta-large models (for predicting dimensions) to HuggingFace.

They can be found in our lab's HF page: https://huggingface.co/CSSLab

You can try the Inference API on the right-hand side.

joshnguyen99 commented 7 months ago

Here's the performance of a multi-label classifier. I used the non-chat version of TinyLlama (1.1B params) and fine-tuned it once using the multi-label version of our dataset.

Feature Precision Recall F1 AUROC
behavior 0.843 0.548 0.664 0.744
everyday 0.711 0.507 0.592 0.669
figure_of_speech 0.667 0.291 0.405 0.769
judgment 0.774 0.533 0.631 0.594
opinion 0.762 0.552 0.641 0.675
reasoning 0.691 0.786 0.735 0.645
Micro 0.747 0.564 0.643 0.724
Macro 0.741 0.536 0.611 0.683

Overall this looks better than the RoBERTa models used before, but not significantly. The perk here is that this is only one multi-label model instead of six binary classifiers.

(I also fine-tuned LLaMA-2 7B for this task but it actually performs worse than TinyLlama. I suspect it's mostly because the dataset is small relative to the model size, evidenced by the the high variance achieved relatively during fine-tuning.)

markwhiting commented 7 months ago

Thanks, any proposals to get a better result? I feel like at this point using something like that for some properties and using @amirrr's lates models for others seems like it gets us the best overall rate, though it would be great if we could put it all into a single model like the design you have while achieving a winning quality on every dimension.

markwhiting commented 3 months ago

Moving this to statements repo