Open markwhiting opened 1 year ago
Originally posted by @amirrr in https://github.com/Watts-Lab/Commonsense-Platform/issues/86#issuecomment-1757692680
Also:
Which knowledge category or categories describe this claim? (choose all that apply)
Note, we should check labels against the original version of the statement, because cleaned statements might need different labels.
Once we have a good labeling strategy we should label all the new clean statements freshly Watts-Lab/commonsense-statements#9
Feature based model
leave one category out
for each feature in [behavior ... ]:
training_data = data[category != LOO_category]
test_data = data[category == LOO_category]
model: feature ~ embedding on training_data
predict: feature ~ embedding on test_data
baseline: mode(feature) on training_data
Category based model
leave one design point out (a particular combo of features)
multinomial regression: category ~ embedding
(same style testing regime)
Could try mean or mode for baseline.
Model type can be random forest or XGBoost
Just ask GPT the questions?
Here's the data for the non-GPT approach (leaving out the category "Society and social sciences" since it resulted in the most accurate model).
Feature | Random Forest | XGBoost | GPT |
---|---|---|---|
behavior | -0.333 | -1.475 | 0.097 |
everyday | 0.092 | 0.476 | 0.056 |
figure_of_speech | 0.0 | 0.156 | 0.105 |
judgment | 0.034 | -0.487 | -0.097 |
opinion | -0.022 | -0.126 | 0.165 |
reasoning | 0.028 | 0.502 | -0.128 |
Great, so we seem to need to do better, hahaha. Also, I think it's fine to trim these to 3 decimal places (e.g., -0.333), and we probably only need $R^2$.
Perhaps we can look at columns like (where each one shows the $R^2$ for that model for each feature): Random forest XGBoost GPT
I edited your comment a bit more to indicate what I was thinking. (for some reason Github doesn't send notifications for edits)
The table is complete now. I ran the GPT labeling against the first 2000 of the statements. Refer to this issue for more details about the prompt.
Thanks. Interesting. We're not doing very well.
Just so I understand, how are you doing the score calculation for GPT?
Would you mind making a second table that shows F1 scores for each of these as well?
Feature | GPT Mean | GPT Mode |
---|---|---|
behavior | 0.498 | 0.097 |
everyday | 0.509 | 0.056 |
figure_of_speech | 0.105 | 0.000 |
judgment | 0.529 | 0.035 |
opinion | 0.497 | -0.022 |
reasoning | 0.585 | 0.028 |
Interesting! Would you mind adding doing that for the others too? Just to see if our scores there get a lot better?
These are the Jaccard accuracy, f1 and global $R^2$ (with baseline being average of scores) scores for Random Forest (RF) and XGBoost methods on labeling statements.
Feature | RF Jaccard Score | RF F1 Score | RF Global R-squared | XGBoost Jaccard Score | XGBoost F1 Score | XGBoost Global R-squared | GPT F1 Score | GPT Global R-squared |
---|---|---|---|---|---|---|---|---|
behavior | 0.934 | 0.966 | -0.497 | 0.950 | 0.950 | -1.594 | 0.794 | 0.498 |
everyday | 0.511 | 0.677 | -0.733 | 0.674 | 0.674 | 0.083 | 0.791 | 0.509 |
figure_of_speech | 0.028 | 0.054 | -0.055 | 0.182 | 0.182 | 0.084 | 0.402 | 0.105 |
judgment | 0.939 | 0.968 | -0.031 | 0.963 | 0.963 | -0.588 | 0.772 | 0.529 |
opinion | 0.891 | 0.943 | -0.184 | 0.940 | 0.940 | -0.252 | 0.769 | 0.497 |
reasoning | 0.438 | 0.609 | -0.826 | 0.623 | 0.623 | 0.056 | 0.776 | 0.585 |
How interesting. So none of these is really good enough for everything, though most are OK on some of the features.
One more way we could look at this: for each of these samples, can you balance the data so that the training data have an equal number of each category for each feature.
After taking out a test split, take the smaller group and sample the larger group by the number of items the smaller group has.
@joshnguyen99 — can you also put updates here when you have them. Also, we have a project for this https://github.com/orgs/Watts-Lab/projects/27/views/5 and we can start tracking some of our efforts there.
Sorry for the late message—there were some bugs in my training scripts but I managed to fix them.
Below is the performance by the best model for each feature. I held out 10% of the dataset, stratified by the predicted feature.
Feature | Precision | Recall | F1 | AUROC |
---|---|---|---|---|
behavior | 0.768 | 0.833 | 0.799 | 0.725 |
everyday | 0.630 | 1.000 | 0.773 | 0.474 |
figure_of_speech | 0.641 | 0.294 | 0.403 | 0.741 |
judgment | 0.790 | 0.897 | 0.840 | 0.749 |
opinion | 0.635 | 1.000 | 0.777 | 0.564 |
reasoning | 0.619 | 1.000 | 0.765 | 0.608 |
We have a minor improvement in the F1 score for figure_of_speech
and reasoning
compared to RF (what model was this again @amirrr?)
Compared to GPT it's pretty much the same. But I don't think we used the same test set, so it might be worth it to sync up.
I have also added you to the repo for my finetuning scripts (https://github.com/joshnguyen99/commonsense-playground). Also added you to my wandb project to keep track of finetuning if you're interested.
Great, thanks! Can you move that repo into the Watts-lab org — we like to keep stuff centralized where possible.
Interesting that the simple models seem to be doing best overall still.
I think RF is random forest with embedding as features.
Why don't we set up a common train-test split code, so we can do repeatable splits. I think @amirrr was working on running RF and XGBoost with a balanced training set, which we think will dramatically help on the figure_of_sepech
$F_1$. Can you share those results too @amirrr.
One more thing, we are doing a lot of lm related stuff in commonsense-lm
. Not sure if we want to share that space between our GPT explorations and other models, but I think ultimately we probably want a singular place for it all. We can talk through logistics of that next week.
@markwhiting — Sure, I can move the code to Watts-Lab! For now, I will commit it to a folder within commonsense-statements
, next to Amir's training scripts. Let's talk about how LLM-related code can be organized under one big repo when we meet.
OK, I might have found something in @amirrr's code that led to very different results from mine.
In dimension-prediction.ipynb
, you used this to perform train-test-split:
for outcome in outcomes:
X_train = merged_df[merged_df['category'] != 'Society and social sciences'].embeddings
y_train = merged_df[merged_df['category'] != 'Society and social sciences'][outcome]
X_test = merged_df[merged_df['category'] == 'Society and social sciences'].embeddings
y_test = merged_df[merged_df['category'] == 'Society and social sciences'][outcome]
For example, if outcome == "behavior"
, then
This is not entirely random, and we have very different percentages of positive examples in the training and test sets:
behavior
Train: 63.4% (2503/3950)
Test : 95.4% (436/457)
everyday
Train: 63.7% (2518/3950)
Test : 57.1% (261/457)
figure_of_speech
Train: 20.7% (818/3950)
Test : 7.9% (36/457)
judgment
Train: 69.9% (2763/3950)
Test : 93.7% (428/457)
opinion
Train: 60.6% (2394/3950)
Test : 89.9% (411/457)
reasoning
Train: 63.1% (2494/3950)
Test : 52.7% (241/457)
In other words, the training and test subsets aren't stratified.
Got it. We are going to fix this by sorting statements into two groups based on their embeddings and their cosine similarity and then match them according to whatever we are trying to model. This should make sure the training and testing groups are more balanced, which could help with more accuracy. I will share the train and test in the same repository.
Results with the balanced dataset test:
feature | xgb_accuracy | xgb_f1 | xgb_R-squared | rf_accuracy | rd_f1 | rf_R-squared |
---|---|---|---|---|---|---|
behavior | 0.588 | 0.740 | 0.691 | 0.571 | 0.727 | 0.514 |
everyday | 0.522 | 0.686 | 0.656 | 0.499 | 0.665 | 0.426 |
figure_of_speech | 0.295 | 0.455 | -0.053 | 0.304 | 0.466 | -0.456 |
judgment | 0.563 | 0.720 | 0.695 | 0.541 | 0.702 | 0.491 |
opinion | 0.570 | 0.726 | 0.671 | 0.553 | 0.712 | 0.476 |
reasoning | 0.501 | 0.667 | 0.650 | 0.462 | 0.631 | 0.393 |
@amirrr and @markwhiting, I have uploaded the 6 roberta-large models (for predicting dimensions) to HuggingFace.
They can be found in our lab's HF page: https://huggingface.co/CSSLab
You can try the Inference API on the right-hand side.
Here's the performance of a multi-label classifier. I used the non-chat version of TinyLlama (1.1B params) and fine-tuned it once using the multi-label version of our dataset.
Feature | Precision | Recall | F1 | AUROC |
---|---|---|---|---|
behavior | 0.843 | 0.548 | 0.664 | 0.744 |
everyday | 0.711 | 0.507 | 0.592 | 0.669 |
figure_of_speech | 0.667 | 0.291 | 0.405 | 0.769 |
judgment | 0.774 | 0.533 | 0.631 | 0.594 |
opinion | 0.762 | 0.552 | 0.641 | 0.675 |
reasoning | 0.691 | 0.786 | 0.735 | 0.645 |
Micro | 0.747 | 0.564 | 0.643 | 0.724 |
Macro | 0.741 | 0.536 | 0.611 | 0.683 |
Overall this looks better than the RoBERTa models used before, but not significantly. The perk here is that this is only one multi-label model instead of six binary classifiers.
(I also fine-tuned LLaMA-2 7B for this task but it actually performs worse than TinyLlama. I suspect it's mostly because the dataset is small relative to the model size, evidenced by the the high variance achieved relatively during fine-tuning.)
Thanks, any proposals to get a better result? I feel like at this point using something like that for some properties and using @amirrr's lates models for others seems like it gets us the best overall rate, though it would be great if we could put it all into a single model like the design you have while achieving a winning quality on every dimension.
Moving this to statements repo
Check how GPT labels statements on our labeling task. Use $Global R^2 = 1-\frac{mse(prediction,actual)}{mse(baseline,actual)}$ to score, and we can visualize in observable.
Would be nice to see how we do on each question.