eubinecto commented 4 years ago

The task

Classify context into matching def for the following (very small) dataset:

context	idiom	def
see it right there dude and doom you know beat him fair and square raising powerful Anthony but I don't	fair and square	with absolute accuracy
yeah you know i mean i mean george beat me fair and square that night but uh there was a lot going on	fair and square	with absolute accuracy
just took their guy and just said we've voted him in fair and square and you guys just removed him	fair and square	honestly and straightforwardly
dude I'm sure the stuff they're doing up on Capitol Hill is all fair and square hey you know you can just believe it	fair and square	honestly and straightforwardly

The question I'd like to answer

No clean up vs. clean up relatively meaningless words (e.g. but, just, and, of, the, etc)

Hypothesis

If you filter out but, just, and, of, the etc, the model will perform better.

Aim of this experiment

To gain familiarity with pipeline (How to use Word2Vec, using tokenisers, etc).

Variables

uncontrolled
- cleaning up vs. not cleaning up
controlled
- model: pretrained Word2Vec <- where should I get this from?
- tokenizer: just use a standard tokeniser
- distance metric: cosine similarity
- join operation: vector addition

Evaluation metric

As the dataset is of tiny size, I'll just use accuracy for evaluating the model.

eubinecto commented 4 years ago

Side notes

Catch whatever thoughts that fly over your head and write it down here.

Questions

What POS's carry relatively less meaning? (e.g. Determiner, preposition, etc?)

...

Am I doing topic modeling?

You are classifying a sentence(contexts) into a topic (definition).

good blog posting on this: Word2Vec으로 문장 분류하기

pragmatics -> how does this relate to the motivation for `youtora`?

It's what prof. Goran mentioned in one of the meetings.

WSD -> Do you really want to do this?

Although we know a word could mean different senses, but that's not really how we understand a language. What we memorize is not the different senses, but an abstract invariants among different senses.

Different senses are different because

What next?

try doing stemming? (could refer to that final year report prof. Goran sent me). But you will need a reason for that
lemmatisation?

eubinecto commented 4 years ago

Using word2vec

What library should I use?

Somebody asked the same question on Stackoverflow: How to use pre-trained word2vec model?

and one of the answers was using gensim(gensim as in "generate similarity". github | landing page ):

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

yup, It seems I've found the right one to use; similarity retrieval is what I want to do.

You can use gensim to build word embedding with just 2 lines of code, like so:

# credit: https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
# Word2Vec embedding
from gensim.models import Word2Vec
embedding_model = Word2Vec(tokenized_contents, size=100, window = 2, min_count=50, workers=4, iter=100, sg=1)

if you want to use a pre-trained word2vec model, you can do so by instantiating it from a vector file, like so:

# credit: https://stats.stackexchange.com/a/267173
import gensim
model = gensim.models.Word2Vec.load_word2vec_format('path-to-vectors.txt', binary=False)
# if you vector file is in binary format, change to binary=True
sentence = ["London", "is", "the", "capital", "of", "Great", "Britain"]
vectors = [model[w] for w in sentence]

what I want to do as of right now is the latter; just take a pretrained state-of-the-art word2vec off the shelf and do tests with it.

As for exp1, I'll try doing this with gensim, but there are obviously other libraries I can use for this as well. The following list is to name just a few of them. (just in case I might need later)

fasttext
- developed by facebook
- "a library for efficient learning of word representations and sentence classification."
polyglot
- Stony Brook Univ
- for learning embeddings in languages other than English
wordvectors
- 조규병님
- "Pre-trained word vectors of 30+ languages"

What is the state-of-the-art pretrained word2vec atm?

There are some options here.

Wikipedia2Vec

"We provide pretrained embeddings for 12 languages in binary and text format."

you can use wikipedia2vec library, but the word vector files are compatible with gensim

performance? can not find.

LexVec

"This is an implementation of the LexVec word embedding model (similar to word2vec and GloVe) that achieves state of the art results in multiple NLP tasks, as described in these papers."

perfomance? outperforms fasttext (slightly better)
Subword LexVec: LexVec trained on subwords
LexVec: LexVec trained on words
58B tokens

fasttext

pre-trained word vectors

developed by facebook
max: 600B tokens

what is a "vector file"? what are its contents? Compatibility?

What "vector file" refers to is KeyedVectors. It is a universal structure/format for containing pre-trained word2vec model.

from gensim's landing page:

"Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module. The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}."

what is "subwords"? when would I need a subwords2vec model?

You'll need subwords2vec model if your are using subwords tokeniser to tokenise words. e.g.

Transformer -> Trans, former
the two subwords may not exist as keys in keyedVectors.

From the landing page for Wikipdedia2Vec:

...The text files are compatible with the text format of Word2vec.

eubinecto commented 4 years ago

Measuring similarity between two sentences

How did others do this?

There is a field in NLP called "Word Sense Disambiguation". Which is similar to what I want to do here. The papers below are some papers that pertain to this topic.

Embeddings for Word Sense Disambiguation: An Evaluation Study
- "We show how a WSD system that makes use of word embeddings alone, if designed properly, can provide significant performance improvement over a state-of- the-art WSD system that incorporates sev- eral standard WSD features."

eubinecto commented 4 years ago

Hmm...

You might want to focus on one thing here. If the only reason you are doing this is so that you can do some AI-related stuff, then you'd better off use try using AI for searching. That seems more effective and fun! what you are doing will be much focused. Just focus on one thing. That's what matters.

eubinecto / idiom2vec

experiment 1: filtering out stop words #1

The task

The question I'd like to answer

Hypothesis

Aim of this experiment

Variables

Evaluation metric

Side notes

Questions

What POS's carry relatively less meaning? (e.g. Determiner, preposition, etc?)

Am I doing topic modeling?

pragmatics -> how does this relate to the motivation for `youtora`?

WSD -> Do you really want to do this?

What next?

Using word2vec

What library should I use?

What is the state-of-the-art pretrained word2vec atm?

Wikipedia2Vec

LexVec

fasttext

what is a "vector file"? what are its contents? Compatibility?

what is "subwords"? when would I need a subwords2vec model?

Measuring similarity between two sentences

How did others do this?

Hmm...

eubinecto / idiom2vec

experiment 1: filtering out stop words #1

The task

The question I'd like to answer

Hypothesis

Aim of this experiment

Variables

Evaluation metric

Side notes

Questions

What POS's carry relatively less meaning? (e.g. Determiner, preposition, etc?)

Am I doing topic modeling?

pragmatics -> how does this relate to the motivation for youtora?

WSD -> Do you really want to do this?

What next?

Using word2vec

What library should I use?

What is the state-of-the-art pretrained word2vec atm?

Wikipedia2Vec

LexVec

fasttext

what is a "vector file"? what are its contents? Compatibility?

what is "subwords"? when would I need a subwords2vec model?

Measuring similarity between two sentences

How did others do this?

Hmm...

pragmatics -> how does this relate to the motivation for `youtora`?