eubinecto / idiomify

Exploring the Efficacy of Idiomify: How Effective is GPT-3 for Teaching Idioms to EFL Writers?
15 stars 0 forks source link

Chronicles #4

Open eubinecto opened 2 years ago

eubinecto commented 2 years ago

m-1-x models 🔰 (Seq2Seq with BART)

m-1-x versions are primarily meant to be as a demonstration, or piloting of the tools I'll be building. 1 means that the architecture does not change from that of a vanilla BART. This model does not regard idioms as a single entity.

m-1-1

The very first baseline of Idiomify. This model is trained only on the first 146 entries of PIE dataset.

m-1-2

The scaled-up version of the previous model. No significant change has been made, it is just that m-1-2 is now trained on the entire entries of PIE dataset (train=0.8). This is also the first version that is deployed to the web via streamlit & huggingface.

m-1-3

you have to search every single word to see where the change is!
image

This is rather inconvenient. We need some way of telling the user that signifies "here is the part that has been changed". m-1-3 is a new version for doing that. It is trained on the same dataset as the previous version, but now two special tokens are added before and after idioms: <idiom> & </idiom>

now it looks much better!
image

m-1-4 - just don't split the sentences

Why am I going back to using BART? This may not be absolutely terrible yet.

m-1-5 just don't include the special tokens and treat this as a simple seq2seq problem

Why is only one idiom suggested? could it be because of the special tokens?

m-2-x models 🏷 (NER with BERT)

m-1-x models demonstrated some potentials, but not without problems. There are some problems that stem from the nature of a seq2seq approach to solving Idiomify.

  1. The model occasionally distorts the input sentence. We hope that the model will learn to "copy", and it indeed does, but we can never be entirely sure about this with a seq2seq approach. In an NER approach, you can be entirely sure that the source sentence is preserved because we get to label the sentence rather than transform the sentence.
  2. Normalising variations of idioms into their lemma is like squaring a peg in a round hole. Yeah, you could do something like You were <idiom> beat around the bush </idiom> when I first interviewed you last time, where beating around the bush is the correct form the idiom. if recommending with a normalise form is what you want to do at the end of the day, then the task is more of a NER task than a seq2seq task, where each idiom is a named entity.

So, what could be better is an NER system rather than a translation system. Granted, it does not explicitly "idiomify" sentences, but it can recommend what idioms to use for what parts of the sentence. I'm not sure if this will turn out to perform better than seq2seq, but one thing we can gurantee for sure is that NER won't distort the source sentence.

m-2-1

This is the first version of m-2-x models. As for the labels, we just follow the IOB convention.

v3.0 Idiomify with GPT-3

TL;DR - use GPT-3 rather than BERT.

But why a sudden switch from NER with BERT to seq2seq with GPT-3? This is for the following two reasons:

First, the few-shot performance of GPT-3 is surprisingly better than I thought. Just have a look at the example below.

an example of few-shot Idiomify. The proof is in the pudding!
image

Woah, and that is a result I got with only a handful, carefully curated list of examples, which is perfectly doable within a few hours. Yes, GPT-3 is expensive and I'll never use this if I were in the industry. The RoI of a GPT-3 based application will be stupidly low unless you charge 100 dollars a month to customers. But hey, I'm an academic. as long as I can get it to work on only a dozens of personal statements. It's okay to stop being an NLP engineer for a few months and just embrace the world of prompt engineering, especially if the performance gain is huge enough to justify the price.

But then what would be the point of your research, you may ask. Surely, merely presenting a use case of GPT-3 is by no means research in the field of NLP. That is just another interesting NLP project, because it does not improve any inductive bias on anything. Then what justifies my switch to GPT-3? I am not an NLP researcher, technically. I'm an SLA researcher. That is, the aim of my research should be (and frankly, should have been) about coming up with and justifying better ways of teaching a second language to EFL learners.

And that is the second reason for the switch to GPT-3. The top priority of my research should not be designing better inductive bias. Rather, you should just use the best tools out there to build the feedback system as soon as possible, and focus on asking the right questions and answering them with scientific methods.

So, here are the two reasons, re-iterated:

  1. The Idiomify performance of GPT-3 is unexpectedly better than I thought
  2. Suggesting better inductive bias is not my top priority

And so it begins, the world of prompt engineering.

to-do's

v3.0.1 - prompt design with a password check

The fine-tuning approach does not seem to work very well, for whatever reason that I don't know of. But I must come up with a complete version by this Friday. I should have a back-up plan.

This version is a minor upgrade from v3.0, where I try to keep v3.0 but password check from v3.1 is added. I'm doing this just in case I end up going back to this prompt design for my research.

v3.0.2 - pay-your-own-request version

Rather than allowing access to only those who know the master key, it is better to just open access to the web to anyone but request them to register their API-KEY. Since OpenAI gives away 30 dollars worth of API credit, that should account for enough requests so far as my research participants are concerned.

V3.0.3 - preparing for automating the research

v3.1 fine-tune Davinci with more quality examples

Approaching this with few-shot learning is not sustainable, as the API's are just too expensive. You must fine-tune a model to build this model successfully.

They say aim for up to 500 examples.

some time in the next version

you might want to evaluate your fine-tuned model with an extrinsic measure

eubinecto commented 2 years ago

changing the name of this issue from Progress to Chronicles. Mostly inspired by Meta's inspiring logbook for OPT - https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles