chankoo / BOAZ-projects

BOAZ Adv
4 stars 4 forks source link

Papers #1

Open chankoo opened 6 years ago

chankoo commented 6 years ago

word2vec Parameter Learning Explained.pdf

chankoo commented 6 years ago

Deep Neural Networks for YouTube Recommendations_2016_google.pdf

chankoo commented 6 years ago

Deep Learning based Recommender System; A Survey and New Perspectives.pdf

chankoo commented 6 years ago

태그기반 추천 Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions ( Gediminas Adomavicius and Alexander Tuzhilin, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 6, JUNE 2005)

chankoo commented 6 years ago

Content-Based Collaborative Filtering for News Topic Recommendation.pdf

chankoo commented 6 years ago

기계학습_기반의_뉴스_추천_서비스_구조와_그_효과에_대한_고찰.pdf

Gangsss commented 6 years ago

github : https://opentutorials.org/course/2708

chankoo commented 5 years ago

Aspect extraction for opinion mining with a deep convolutional neural network(aspect detection을 위한 multi-level CNN): http://sentic.net/aspect-extraction-for-opinion-mining.pdf

Skip-Thought Vectors(문장수준 임베딩): https://arxiv.org/pdf/1506.06726.pdf

chankoo commented 5 years ago

참고문서: NLP를 위한 딥러닝 가이드 http://docs.likejazz.com/deep-learning-for-nlp/

chankoo commented 5 years ago

LARA https://www.cs.virginia.edu/~hw5x/paper/rp166f-wang.pdf https://www.cs.virginia.edu/~hw5x/paper/p618.pdf

chankoo commented 5 years ago

[What Airbnb Reviews can Tell us?] 요약

Data Pre-processing

First, since they “crawled” tweets online, they obtained some useless information (such as URL links), and the Repackage program was used to remove this unnecessary information. Next, they converted all of the posts into “bags” of words, by dividing sentences into separate words using the tokenization function. As mentioned in the first section, not all of the words (such as “I”, “am”, “is”, and “are”) in a sentence are useful for sentiment analysis. These words are referred to as stop words. Pak and Pariybek (2010) removed stop words to reduce the overall number of words. To enhance the accuracy of the results, the negation problem was solved by using n-grams. This technique eliminates, for example, the two bigrams such as “do not” and “like not”.

Words reflecting guest experience aspects were retained in the dictionary, except for: (a) stop words like “about”, “can”, “does”, and “a/an”; (b) abbreviation like “LA”, “CA”, “I’m”, and “aren’t”; (c) highly ambiguous words such as “go”, “do”; e) words related to Airbnb location such as “Los Angeles”, “California”, and “United States”.

Information Extraction

As a result, beside the frequently employed nouns/noun phrases, verbs can also tend to be genuine and salient information that hidden in customers’ reviews.

Topic modeling is a valid method to extract meaningful information from a large amount of textual data. This study applied the most comment topic modeling method LDA to extract valuable information, which used as the seed words in the later boot-strapping procure.

Information categorization

Topic modeling methods actually perform both information expression discovery and categorization at the same time in an unsupervised manner, because topic modeling is used for classification in a document collection (Andrzejewski, Zhu, & Craven, 2009). Since the present study is innovative in applying LDA for information extraction, the algorithm will automatically classify the information into different aspects.

Aspect Segmentation

Established on their proposed annotation of aspect, the dependencies between each aspect and word were calculated using the Chi-Square (χ2) statistic proposed by Wang et al. (2010), and the phrase which as high dependencies were included in the corresponding aspect keyword documents obtained from the previous step. This calculation was repeated unless the key word list of each aspect exceeded the iteration limitation or was unchanged. In study, both verbs and nouns were generated from the information extraction stage by using LDA, and these words were used as the seed words in to the aspect segmentation to get the keywords. Thus, the total key words after this stage include nouns, verbs, adjective etc.

Sentiment Detection

Sentiments/opinions and their targets/aspect relationships allow sentiment words to be determined via their identified aspects; aspects can be determined via known sentiment words. In order to do so, sentiment words and aspects are propagated, resulting in the term “double propagation” for this process. Specific dependency relationships between sentiment words and aspects are used to develop extraction rules. The results of a study by Tesniere (1965) revealed that adjectives could be considered sentiment words.

The NRC Sentiment and Emotion Lexicons, which is an adjective dictionary, is used as the sentiment polarity in the later aspect rating prediction calculation. Since from the previous stage, the key words are mixed nouns, verbs, adjective, etc., the nearest adjective’s sentiment value of the key word is counted into further calculation. If the key word is adjective, it will directly use as the sentiment polarity from the NRC Sentiment and Emotion Lexicons.

chankoo commented 5 years ago

[What Airbnb Reviews can Tell us?] 요약 계속

Latent Rating Regression

The latent rating regression (LRR) model was designed to formally capture the above-described generation process. After aspect segmentation, a word frequency matrix Wd was generated for each listing d, which provided the normalized frequency of the words belong to each aspect. In this model, Wd was treated as independent variables (for example, the features of the listing d), while the listing rating was treated as the dependent variable (which is predicted the variable). The LRR assumed that overall listing ratings were not directly determined by features of word frequency. Instead, the model used the latency aspect ratings that were more pointedly determined by aspect frequency with combination of their corresponding weights. The work of Wang et al. (2010) was adapted to show that the review-leveldimensional aspects weight vector sd was a linear sum of Wdi and β. β ∈ ℜ indicated the polarities of sentiment on aspect Ai obtained from the sentiment detection. The weighted sum of aspect rating sd and aspect weight αd determined the overall rating. Specifically, it was assumed that the overall rating was a sample obtained from a Gaussian distribution indicating that the overall rating predictions were uncertain. Wang et al. (2010) further discovered that reviewer emphasis on various aspects can be a complex issue due to the various factors involved. For instance, reviewers may show different preferences for the aspects at hand (that is, business travelers may place an emphasis on Internet service, while honeymooning couples may be more concerned with the listings). Furthermore, aspects may not be independent, particularly when certain aspects overlap (that is, reviewers interested primarily in cleanliness most likely are more interested in the listings themselves as well). Wang et al. (2010) accommodated reviewer preference diversity by treating the aspect weight αd of each listing d as a free variable obtained from an underlying prior distribution for the listings as a whole. Multivariate Gaussian distribution was also used as the prior distribution for aspect weights to capture the different dependencies among each aspect. In reviews using the LRR model in the study, the observed overall rating probability was given as such: where rd and Wd were the observed in previous analysis listing d, μ, Σ, δ2 , β was the listing level parameters, and αd was the latent aspect weight of listing d; μ, Σ and δ2 was not dependent upon individual reviewers and were deemed as aspect-level parameters. The LRR model is graphically represented in Figure 3.

1

2

LRR Model Estimation

This section contains a discussion of how the model parameters were estimated in the present study applying the maximum likelihood (ML) algorithm. In other words, the ML estimator was employed to obtain the optimal Θ = (μ, Σ, δ2) to maximize the likelihood of overall ratings. Adapted from Wang et al. (2010), the log-likelihood estimator was applied to all the customer reviews:

3

For ML estimation, all of the parameter values were first randomly initialized to find the probable Θ(0). The EM-style algorithm was used to update and increase the parameters iteratively by alternately performing the E-step and the M-step in each iteration as follows:

  1. E-Step: For each listing d, the author inferred the aspect rating sd and aspect weight αd with Θ(t) current parameter, which t represents the iteration) based on the discussion above.
  2. M-Step: Based on aspect rating sd and aspect weight αd obtained from the existing parameters Θ(t), the updated parameters adapted from Wang et al. (2010) were employed and Θ(t+1) was obtained through the maximization value of the “complete likelihood” including overall ratings rd, the aspect ratings sd, and the aspect weights αd of listing d. Here, the goal was the maximization of probability of observing all αd gained the current step. Thus, Wang et al. (2010) updated and developed Gaussian distribution formula based on the ML estimation:

4

chankoo commented 5 years ago

[What Airbnb Reviews can Tell us?] 요약 계속

5

6