bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Choosing the right keyword detection technique through udpipe. #28

Closed adjoshi81 closed 5 years ago

adjoshi81 commented 5 years ago

Hi @jwijffels ,

Hope you are doing good. Thanks for building udpipe in R which is really useful for POS tagging and keyword detection.

Below is a rather lengthy query, I have encountered while using udpipe's English model on a column of text.

Using the link: https://bnosac.github.io/udpipe/docs/doc7.html, I was testing 2 such approaches a. RAKE and b. Dependency Parsing for keyword detection technique. During this comparison, I found that the dependency parsing approach results in reporting a lower count of certain phrases as compared to the RAKE approach.

The input data is attached here: Book1.xlsx

The R Code (in .txt format) is attached here: POS_Viz_and_Dep_Parsing_v1.txt

Line # 125 to 127 in this code, runs the RAKE algorithm and identifies the keyword phrases containing only the Nouns or Adjectives. One of the keyword phrase is "good product" which has count of 139, as seen in the object top_phrases_noun_adj.

Line # 133 to 151 computes the nominal subjects through dependency parsing. It uses the dep_rel as "nsubj" & upos %in% c("NOUN") & upos_parent %in% c("ADJ") to identify such phrases. Looking at the same keyword phrase i.e. "good product" in object dep_parse_nsubj3 gives a count of 18.

This is definitely lower than RAKE as there are couple of more filter conditions related to POS of parent work and dep_rel = nsubj, which seems to be the expected behavior.

Next, I modified the dependency parsing code as seen in Line # 154 to 171 computes the nominal subject in a different way than found in Line # 133 to 151, this time including only the filter condition upos %in% c("NOUN", "ADJ"). This is same as the condition used in the lines which compute the keywords through the RAKE approach.

The same keyword "good product" as seen in the rewritten object dep_parse_nsubj3 now has a count of 21 while its reverse term i.e. "product good" has a count of 156.

I am unsure on why the 2nd variation of dependency parsing code gives a lesser count of this keyword term as compared to the RAKE approach.

Moreover, could you suggest which approach is better in terms of identifying keywords as these can be further used for other tasks like low level theme detection/bucketing of sentences and so on.

Thank you for your help in advance and best regards.

jwijffels commented 5 years ago

I think this question is better answered on stackoverflow. This does not seem to be a bug. The difference between your 2 approaches is that you set dep_rel %in% c("nsubj") in the first approach and in the other dependency parsing approach, you do not use that filter, you basically paste nouns or adjectives with the head word of that noun or adjective. RAKE looks to a contiguous sequence of terms (as in words next to each other). Dependency parsing can link words which are not necessarily right next to each other. Regarding the question on methodology. I'm sure you'll find this out yourself when you try out different keyword extraction techniques for your domain. This type of question should be asked on stackexchange. I tend to use a combination of manual input, frequencies, statistics sometimes in combination with embeddings to validate keywords. This is really a matter of personal taste and being comfortable with the techniques. There is only one golden rule: inspect your data. In your case, if I were you I would be asking myself if I do the selection upos %in% c("NOUN") & upos_parent %in% c("ADJ") which is the dep_rel when the word is product and what do these dep_rel elements mean as defined at http://universaldependencies.org/u/dep/index.html

adjoshi81 commented 5 years ago

Hi @jwijffels,

Thanks for your response. I had asked this question on Stackoverflow, (https://stackoverflow.com/questions/52205992/understanding-rs-udpipe-output-related-to-phrase-detection). But since I didn't attach data and code for this query, thought it would be appropriate to ask it here. I agree to your statement that RAKE will output contiguous words whereas depending parsing might associate words which are separated by 'x' number of words in between, where 'x' itself varies. Hence comparing counts for the same keyword directly will not make sense.

However, I was looking to see if using the same base/filter conditions for RAKE (Line # 125 to 127 of the code) and dep_parsing ( Line # 154 to 171 of the code) gives the same count for each phrase or not. It doesn't seem to be the case. This was slightly confusing, hence I wanted to know if one approach is superior to other or not, better, if there are any guidelines which will help me to use a particular method over the other (if such exists).

I will try to check for similar questions on Stack sites to see if I can get some answers in couple of days.

Thanks.

jwijffels commented 5 years ago

@adjoshi81 You can ask methodological questions about RAKE versus dependency parsing on https://stats.stackexchange.com This github repository is for issues (bugs / feature requests), not for methodological questions. RAKE looks for words following one another, dependency parsing can find how words are linked to one another even if they are not next to each other. That is the reason why you have different counts.

adjoshi81 commented 5 years ago

@jwijffels: Thank you for this clarification. As you rightly mentioned that RAKE works on continuous or sequential words unlike dependency parsing, then shouldn't it be the case that the count of occurrences of a keyword identified through dependency parsing should at least be the same as the one reported by RAKE (or maybe even higher)? I was prompted to ask this question taking a cue from issue #11 which was also not an issue or a bug, but related to package usage. Thank you for inputs, and I will check on some of the other websites for this answer. Regards.

jwijffels commented 5 years ago

Why would it be at least higher. If the words are next to each other (e.g. RAKE), that does not mean that the dependency parsing would indicate that they are linked through the head word. Neither is it the other way around that if they are linked through the head that they are next to each other.

adjoshi81 commented 5 years ago

This would be true given one uses the nsubj as the dependency relationship along with pos tags as adj or noun, that where the head word etc should come into play. Though it wouldn't make sense at all, that's what I did when I tried to use the only the pos tags and not the dependency relationship which I have referred in my original post as the second variation of dependency parsing. Even then these counts didn't match.

jwijffels commented 5 years ago

I've written above: in the other dependency parsing approach, you do not use that filter, you basically paste nouns or adjectives with the **head word of that noun or adjective**. The head word could be whichever pos tag, not necessarily nouns or adjectives like product or good.