Exploring Semantic Spaces - Fundamentals

HyunkuKwon commented 4 years ago

Post questions here for one or more of our fundamentals readings:

Jurafsky, Daniel and James H. Martin. 2015. Speech and Language Processing. Chapters 15-16 (“Vector Semantics”, “Semantics with Dense Vectors”)

wanitchayap commented 4 years ago

Chapter 15 1) I am glad that the paper brings up the second-order co-occurrence since it is a problem I am concerned with while reading about the word embedding. For example, words that are perfect synonyms would not occur together in the texts, but we actually want them to be very close in the embedding space. However, the reading doesn't really give much detail about the second-order co-occurrence at all. I want to clarify that:

could word2vec accurately deal with second-order co-occurrence?
how can word2vec achieve such accuracy on second-order co-occurrence?
could sparse vector methods (e.g. PPMI, tf-idf, LSA) also deal second-order co-occurrence)
if not, why?

2) The reading mentions that we can either choose to throw the context matrix from word2vec away or average/concatenate the context matrix with the target matrix. However, the reading doesn't elaborate further detail. What would be the advantages and disadvantages of each choice?

ihsiehchi commented 4 years ago

Dependency parsing and word embedding: If I understand correctly, "local context" is defined using the collection of 5-grams, which means the sequences of five words that appear frequently enough. I wonder whether dependency parsing can improve word embedding since we may have many long sentences in which the subject and the (in)direct objects are far apart.

Can we have trans-sentence dependency parsing? This may be useful when we are interested in transitivity in text: A does something to B in one sentence and B does something to C in another, it may be useful if we can retain the information that A indirectly causes C. I imagine this could potentially be useful in a history project when one wants to study the factors that indirectly caused, say, the French revolution.

timqzhang commented 4 years ago

For "Dependency Parsing"

This chapter introduces an interesting parsing algorithm called the "Graph-Based Dependency Parsing" (pp.17), which encodes the search space as directed graphs to apply the graph theories. It drives me to think about if we also could apply the vector space methods for the parsing, that is, to increase the dimensions of the graph-based parsing and parse based on the vectors? It may have some difficulties, as additional to what we do in constructing the semantic space, we should also consider the POS of words in sentences, so I'm not sure whether it is feasible.

For "Vector Semantics and Embeddings"

My question is for the section of "Embeddings and Historical Semantics", on the visualization of Figure 6.14 (pp.24). It is mentioned that:

The modern sense of each word, and the grey context words, are computed from the most recent (modern) time-point embedding space. Earlier points are computed from earlier historical embedding spaces.

As the figure put the target words from different time period into one visualized figure, I wonder how to project a word from the historic semantic space into the current semantic space, especially without changing the relative position of the grey context words? It will make sense if the figure is jointed by figures from different time, but it seems not by reading the descriptions (quoted above).

Also, should we read Chapter 6, "Vector Semantics and Embeddings", instead of Chapter 15 and 16? Chapter 15 is about dependency parsing, and Chp16 about Logical Representations of sentence meaning, which seem irrelevant to this week's topic.

nwrim commented 4 years ago

I think word2vec is awesome and really look forward to trying it out in my final project.

I was wondering what could be a good way to deal with a set of a lot of short documents (perhaps tweets, or even shorter than that). For example, you can have a lot of documents that are 4~5 words long but so most of the words do not have enough context words to fill their "window". In addition, what happens to words that are at the very beginning of documents and end of documents? Do they just use fewer context words?

Yilun0221 commented 4 years ago

It is surprising that neural networks can be applied to nlp as well! My question is about the hidden layers. How should we choose hidden layers? How should we explain these hidden layers in the nlp model or text?

WMhYang commented 4 years ago

I also read Chapter 6 as mentioned by @timqzhang.

My first question is similar to @wanitchayap. In my coding practice, I did not really see the difference between first-order co-occurence and second-order co-occurence. I wonder how could we use word2vec to get this distinction.

My second question corresponds to Figure 6.14 on page 117. Since my project may utilize this method to see the change of words over time, I am wondering how could we building separate embedding spaces with different models and combine together into one figure for visualization.

harryx113 commented 4 years ago

Tuning weights for gradient descent or any other algorithm is often a trial-and-error process for newbies in Neural Network. What kind of rules do senior engineers follow to best adjust the hyper-parameters?

linghui-wu commented 4 years ago

I would like to thank @WMhYang for mentioning the methodology in Chapter 6 investigating the dynamic change of word usages, which provides me a hint apart from dynamic topic modeling that we learned last week.

For this week’s reading, I noticed that it is necessary to generate training data for the transition-based dependency parsing. I wonder what is the “appropriate” size of the training set in order to obtain a reliable model? Would this algorithm be robust if we are not able to provide enough information?

tianyueniu commented 4 years ago

Related with @Harryx113 's question, it is mentioned in Chapter 7 that we use an optimization algorithm like gradient-descent to train our neural network. In my previous attempt to complete a hw for another class, I basically tried on different optimization algorithms and chose the one that gave me the most 'appropriate' result. I know this is probably not the ideal way to train a model. What are some of the factors that we should consider to pick the best optimization algorithm for a neural network?

Lesopil commented 4 years ago

Chapter 7 discussed neural networks and deep learning specifically. From what I understand, we have little insight into how the hidden layers of the neural network interact with one another to produce the result. We kinda basically see the input and the output. I guess I am wondering a bit more about all of those layers and exactly how they interact to produce accurate predictions.

minminfly68 commented 4 years ago

It is interesting to learn more about Word2Vec in this approach. I am wondering whether we can use this model for topic extraction and training classifier along with other features? Also, what are the similarities and differences between this representation and hierarchical clustering mentioned in the article?

jsgenan commented 4 years ago

Is there a rule of thumb for the best choice of word2vec vs skipgram? like if some kinds of corpus would suit skipgram better? Also can we talk more about the details of de-biasing? It is mentioned in both readings, but many details are lacking.

pdiazm commented 4 years ago

In the GloVe word embedding model, focusing on the word co-occurrence probabilities might lead to bias. For example, if the corpus of analysis was feminist literature – their algorithms might find similar patterns of gender bias by analyzing cooccurrence. Except in this case, the context matters – as it is likely that these cooccurrences are due to a constant and comprehensive critique of the bias itself.

Is there a way to modify the algorithm to prevent or correct for this bias?

lyl010 commented 4 years ago

Word embedding is a feasible tool which is not hard to train and contains rich meanings. Just as a response to @nwrim's response, training word embeddings on short text can be tricky because the results can be unstable, and I think building word embeddings on pre-trained vectors like google news could help your embedding training process converge quickly and also help you to compare the change of embeddings.

My question is also about the stability of embeddings, is there any common validation method used in empirical study help us to state that the embedding actually makes sense?

Computational-Content-Analysis-2020 / Readings-Responses-Spring

Exploring Semantic Spaces - Fundamentals #31