alperyilmaz / dav-exercises

Exercise questions submitted by Data Analysis and Visualization with R course students at YTU
GNU General Public License v3.0
1 stars 2 forks source link

sentiment analysis of my project (LEVENT KILIÇ 1405A015) #54

Open KilicLevent opened 6 years ago

KilicLevent commented 6 years ago

Please fill in the form to submit an exercise question. Please state the question under Question section. Please try to be as specific as possible when describing the problem. In hint chunk, you can provide a statement (which function to use, or which columns to join, etc.) or you can provide first 1-2 lines of expected result. Please refer to Github markdown table instructions if you need to include a table. In solution chunk please provide the code to solve the problem. Your solutions should be runnable in anybody's computer. Thus, please don't include file locations in your own computer while importing data. The data should be coming from a R package or from an online source.

Question

by using sentiment analysis methods, find the most frequently used words and most important words that forms the project abstract.

my_project <- c("ABSTRACT",
                "Heavy metals are preferred as metals whose are heavier five times than water molecule and show toxemic effect event at low concentrations.", 
                "These metals consist of zinc, silver, lead, iron, chromium, copper, arsenic, cadmium and nickel metals.", 
                "The pollution in drinkable water caused by these heavy metals possesses a great threat to the environment, peoples, and other living organisms in recent years.", 
                "This type of pollution can be observed significantly in areas where industrialization is intense.", 
                "A lot of factories in various sectors such as food, pharmaceutical, chemistry, cosmetic and beverage, consumes a big amount of water during their productions and the used water, unless handled carefully, mixed with crude water in river, lakes, seas etc. in turn, causing pollution on a big scale of water.", 
                "Polluted waters cause various diseases like cancer Alzheimer, Parkinson and heart dysfunctions mainly.", 
                "Also, underground water is commonly used for agriculture.", 
                "Therefore, heavy metals can be taken into the body by consuming food watered with contaminated water, since these foods took these heavy metals along with water.")
my_project
library(dplyr)
data_project <- data_frame(line = 1:9, source = c("my_project"), my_project = my_project)
data_project
library(tidytext)
library(stringr)
library(tidyverse)
tidy_project <- data_project %>%
  unnest_tokens(word, my_project)
str(data_project)
tidy_project %>%
  count(source, word, sort = TRUE)
library(ggplot2)
tidy_project %>%
  count(word, sort = TRUE) %>%
  filter(n > 1) %>%
  mutate(word = reorder(word, n)) %>%

  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()
word_project <- tidy_project %>%
count(source, word, sort = TRUE)

totalword_project <- word_project %>% mutate(total_word = sum(n)) totalword_project

progect_A <-left_join(word_project, totalword_project, by = c("source", "word", "n"))

```{r}
progect_A %>%
  mutate(tf = n/total_word)
project_tf_idf <- progect_A %>%
  bind_tf_idf(source, word, n)
project_bigrams <- data_project %>%
  unnest_tokens(bigram, my_project, token = "ngrams", n = 2)
project_bigrams
project_bigrams %>%
  count(bigram, sort = TRUE)
bigrams_separated <- project_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
  bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
  bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)
bigram_counts
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")
bigrams_united
data_project %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")
bigram_tf_idf <- bigrams_united %>%
  count(source, bigram) %>%
  bind_tf_idf(bigram, source, n) %>%
  arrange(desc(tf_idf))
bigram_tf_idf

About images: If your question or solution contains an image, please attach necessary images by dragging them here or copy/pasting from clipboard. After the upload, a markdown style link to image will be generated for you.

Additional information

a specific result is not required, a brief analysis of results will be enough.

Originality

Please mark relevant information with x, (ex. [x])

Is this question

If you select Inspired or Paraphrased please provide the links in markdown format ( [link](http://example.com) ). Please provide all relevant links. You can refer to DataCamp course pages if you're inspired by them.

Difficulty Level

According to you, what is the level of difficulty of the question (note: this can be modified by instructor after submission)

Tags (optional)

Please provide comma separated list of dplyr verbs (e.g. summarize, left join) or concepts (e.g. text mining) that you think are relevant with question `gruop_by, count, unnest_tokens, filter, sentiment analysis, bind_tf_idf, bigram

Before submitting