eubinecto / idiom2vec

Idiom2vec: learning vector representation of English idioms with gensim (Word2Vec)
1 stars 0 forks source link

experiment 1: filtering out stop words #1

Closed eubinecto closed 3 years ago

eubinecto commented 4 years ago

The task

Classify context into matching def for the following (very small) dataset:

context idiom def
see it right there dude and doom you know beat him fair and square raising powerful Anthony but I don't fair and square with absolute accuracy
yeah you know i mean i mean george beat me fair and square that night but uh there was a lot going on fair and square with absolute accuracy
just took their guy and just said we've voted him in fair and square and you guys just removed him fair and square honestly and straightforwardly
dude I'm sure the stuff they're doing up on Capitol Hill is all fair and square hey you know you can just believe it fair and square honestly and straightforwardly

The question I'd like to answer

No clean up vs. clean up relatively meaningless words (e.g. but, just, and, of, the, etc)

Hypothesis

If you filter out but, just, and, of, the etc, the model will perform better.

Aim of this experiment

To gain familiarity with pipeline (How to use Word2Vec, using tokenisers, etc).

Variables

Evaluation metric

As the dataset is of tiny size, I'll just use accuracy for evaluating the model.

eubinecto commented 4 years ago

Side notes

Catch whatever thoughts that fly over your head and write it down here.

Questions

What POS's carry relatively less meaning? (e.g. Determiner, preposition, etc?)

...

Am I doing topic modeling?

You are classifying a sentence(contexts) into a topic (definition).

pragmatics -> how does this relate to the motivation for youtora?

It's what prof. Goran mentioned in one of the meetings.

WSD -> Do you really want to do this?

Although we know a word could mean different senses, but that's not really how we understand a language. What we memorize is not the different senses, but an abstract invariants among different senses.

Different senses are different because

What next?

eubinecto commented 4 years ago

Using word2vec

What library should I use?

Somebody asked the same question on Stackoverflow: How to use pre-trained word2vec model?

and one of the answers was using gensim(gensim as in "generate similarity". github | landing page ):

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

yup, It seems I've found the right one to use; similarity retrieval is what I want to do.

You can use gensim to build word embedding with just 2 lines of code, like so:

# credit: https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
# Word2Vec embedding
from gensim.models import Word2Vec
embedding_model = Word2Vec(tokenized_contents, size=100, window = 2, min_count=50, workers=4, iter=100, sg=1)

if you want to use a pre-trained word2vec model, you can do so by instantiating it from a vector file, like so:

# credit: https://stats.stackexchange.com/a/267173
import gensim
model = gensim.models.Word2Vec.load_word2vec_format('path-to-vectors.txt', binary=False)
# if you vector file is in binary format, change to binary=True
sentence = ["London", "is", "the", "capital", "of", "Great", "Britain"]
vectors = [model[w] for w in sentence]

what I want to do as of right now is the latter; just take a pretrained state-of-the-art word2vec off the shelf and do tests with it.

As for exp1, I'll try doing this with gensim, but there are obviously other libraries I can use for this as well. The following list is to name just a few of them. (just in case I might need later)

What is the state-of-the-art pretrained word2vec atm?

There are some options here.

Wikipedia2Vec

"We provide pretrained embeddings for 12 languages in binary and text format."

  • you can use wikipedia2vec library, but the word vector files are compatible with gensim
  • performance? can not find.

LexVec

"This is an implementation of the LexVec word embedding model (similar to word2vec and GloVe) that achieves state of the art results in multiple NLP tasks, as described in these papers."

fasttext

pre-trained word vectors
image

what is a "vector file"? what are its contents? Compatibility?

What "vector file" refers to is KeyedVectors. It is a universal structure/format for containing pre-trained word2vec model.

from gensim's landing page:

"Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module. The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}."

what is "subwords"? when would I need a subwords2vec model?

You'll need subwords2vec model if your are using subwords tokeniser to tokenise words. e.g.

From the landing page for Wikipdedia2Vec:

...The text files are compatible with the text format of Word2vec.

eubinecto commented 4 years ago

Measuring similarity between two sentences

How did others do this?

There is a field in NLP called "Word Sense Disambiguation". Which is similar to what I want to do here. The papers below are some papers that pertain to this topic.

eubinecto commented 4 years ago

Hmm...

You might want to focus on one thing here. If the only reason you are doing this is so that you can do some AI-related stuff, then you'd better off use try using AI for searching. That seems more effective and fun! what you are doing will be much focused. Just focus on one thing. That's what matters.