`m-1-x` models 🔰 (Seq2Seq with BART)

m-1-x versions are primarily meant to be as a demonstration, or piloting of the tools I'll be building. 1 means that the architecture does not change from that of a vanilla BART. This model does not regard idioms as a single entity.

`m-1-1`

The very first baseline of Idiomify. This model is trained only on the first 146 entries of PIE dataset.

`m-1-2`

The scaled-up version of the previous model. No significant change has been made, it is just that m-1-2 is now trained on the entire entries of PIE dataset (train=0.8). This is also the first version that is deployed to the web via streamlit & huggingface.

`m-1-3`

you have to search every single word to see where the change is!

This is rather inconvenient. We need some way of telling the user that signifies "here is the part that has been changed". m-1-3 is a new version for doing that. It is trained on the same dataset as the previous version, but now two special tokens are added before and after idioms: <idiom> & </idiom>

[x] #5
[x] #7
[x] #9

now it looks much better!

`m-1-4` - just don't split the sentences

Why am I going back to using BART? This may not be absolutely terrible yet.

[x] #27
[ ] experiment - does it perform better than before (m-1-3)?
[ ] experiment - how does it compare against GPT-3 (v3.0.1)?
- my guess - GPT-3 would probably perform worse by false-negative.

`m-1-5` just don't include the special tokens and treat this as a simple seq2seq problem

Why is only one idiom suggested? could it be because of the special tokens?

[ ] #26
[ ] use difflib to highlight what has been changed
[ ] experiment - does it perform better than before (m-1-4 & m-1-3)?
[ ] experiment - does it perform better than GPT-3?

`m-2-x` models 🏷 (NER with BERT)

m-1-x models demonstrated some potentials, but not without problems. There are some problems that stem from the nature of a seq2seq approach to solving Idiomify.

The model occasionally distorts the input sentence. We hope that the model will learn to "copy", and it indeed does, but we can never be entirely sure about this with a seq2seq approach. In an NER approach, you can be entirely sure that the source sentence is preserved because we get to label the sentence rather than transform the sentence.
Normalising variations of idioms into their lemma is like squaring a peg in a round hole. Yeah, you could do something like You were <idiom> beat around the bush </idiom> when I first interviewed you last time, where beating around the bush is the correct form the idiom. if recommending with a normalise form is what you want to do at the end of the day, then the task is more of a NER task than a seq2seq task, where each idiom is a named entity.

So, what could be better is an NER system rather than a translation system. Granted, it does not explicitly "idiomify" sentences, but it can recommend what idioms to use for what parts of the sentence. I'm not sure if this will turn out to perform better than seq2seq, but one thing we can gurantee for sure is that NER won't distort the source sentence.

`m-2-1`

This is the first version of m-2-x models. As for the labels, we just follow the IOB convention.

[x] #12
[x] #11
[x] #15
- as for the tokenizer, we just use the pre-trained one. We need no additional tokens

`v3.0` Idiomify with GPT-3

TL;DR - use GPT-3 rather than BERT.

But why a sudden switch from NER with BERT to seq2seq with GPT-3? This is for the following two reasons:

First, the few-shot performance of GPT-3 is surprisingly better than I thought. Just have a look at the example below.

an example of few-shot Idiomify. The proof is in the pudding!

Woah, and that is a result I got with only a handful, carefully curated list of examples, which is perfectly doable within a few hours. Yes, GPT-3 is expensive and I'll never use this if I were in the industry. The RoI of a GPT-3 based application will be stupidly low unless you charge 100 dollars a month to customers. But hey, I'm an academic. as long as I can get it to work on only a dozens of personal statements. It's okay to stop being an NLP engineer for a few months and just embrace the world of prompt engineering, especially if the performance gain is huge enough to justify the price.

But then what would be the point of your research, you may ask. Surely, merely presenting a use case of GPT-3 is by no means research in the field of NLP. That is just another interesting NLP project, because it does not improve any inductive bias on anything. Then what justifies my switch to GPT-3? I am not an NLP researcher, technically. I'm an SLA researcher. That is, the aim of my research should be (and frankly, should have been) about coming up with and justifying better ways of teaching a second language to EFL learners.

And that is the second reason for the switch to GPT-3. The top priority of my research should not be designing better inductive bias. Rather, you should just use the best tools out there to build the feedback system as soon as possible, and focus on asking the right questions and answering them with scientific methods.

So, here are the two reasons, re-iterated:

The Idiomify performance of GPT-3 is unexpectedly better than I thought
Suggesting better inductive bias is not my top priority

And so it begins, the world of prompt engineering.

to-do's

[x] #16
[x] deploy v3.0 to streamlit cloud

`v3.0.1` - prompt design with a password check

The fine-tuning approach does not seem to work very well, for whatever reason that I don't know of. But I must come up with a complete version by this Friday. I should have a back-up plan.

This version is a minor upgrade from v3.0, where I try to keep v3.0 but password check from v3.1 is added. I'm doing this just in case I end up going back to this prompt design for my research.

[x] #24
[x] tag the version
[x] deploy v3.0.1

`v3.0.2` - pay-your-own-request version

Rather than allowing access to only those who know the master key, it is better to just open access to the web to anyone but request them to register their API-KEY. Since OpenAI gives away 30 dollars worth of API credit, that should account for enough requests so far as my research participants are concerned.

[x] #28
[x] deploy!

`V3.0.3` - preparing for automating the research

[ ] a script for generating fake alias for each participant
[ ] a detailed instructions for signing up to Open AI
[ ] a script for auto-generating Cloze tests (just recalling the definitions in Korean)
[ ] deploy!

`v3.1` fine-tune Davinci with more quality examples

Approaching this with few-shot learning is not sustainable, as the API's are just too expensive. You must fine-tune a model to build this model successfully.

They say aim for up to 500 examples.

[x] update readme.md
[x] #20
[x] #21
[x] #18
[x] #22
[ ] #23
[ ] experiment (1) does v3.1 Idiomify more than 1 phrases, given a long paragraph?
[ ] experiment (2) does v3.1 Idiomify give more natural suggestions? (Does it no longer "square a peg in a hole"?)
[ ] deploy v3.1

some time in the next version

you might want to evaluate your fine-tuned model with an extrinsic measure

[ ] evaluate the model with PPL (intrinsic measure)
- GPT-3 API does output logprobs, so you should be able to compute PPLs.

eubinecto / idiomify

Chronicles #4

`m-1-x` models 🔰 (Seq2Seq with BART)

`m-1-1`

`m-1-2`

`m-1-3`

`m-1-4` - just don't split the sentences

`m-1-5` just don't include the special tokens and treat this as a simple seq2seq problem

`m-2-x` models 🏷 (NER with BERT)

`m-2-1`

`v3.0` Idiomify with GPT-3

`v3.0.1` - prompt design with a password check

`v3.0.2` - pay-your-own-request version

`V3.0.3` - preparing for automating the research

`v3.1` fine-tune Davinci with more quality examples

some time in the next version

eubinecto / idiomify

Chronicles #4

m-1-x models 🔰 (Seq2Seq with BART)

m-1-1

m-1-2

m-1-3

m-1-4 - just don't split the sentences

m-1-5 just don't include the special tokens and treat this as a simple seq2seq problem

m-2-x models 🏷 (NER with BERT)

m-2-1

v3.0 Idiomify with GPT-3

v3.0.1 - prompt design with a password check

v3.0.2 - pay-your-own-request version

V3.0.3 - preparing for automating the research

v3.1 fine-tune Davinci with more quality examples

some time in the next version

`m-1-x` models 🔰 (Seq2Seq with BART)

`m-1-1`

`m-1-2`

`m-1-3`

`m-1-4` - just don't split the sentences

`m-1-5` just don't include the special tokens and treat this as a simple seq2seq problem

`m-2-x` models 🏷 (NER with BERT)

`m-2-1`

`v3.0` Idiomify with GPT-3

`v3.0.1` - prompt design with a password check

`v3.0.2` - pay-your-own-request version

`V3.0.3` - preparing for automating the research

`v3.1` fine-tune Davinci with more quality examples