datacamp / Brand-Analysis-using-Social-Media-Data-in-R-Live-Training

Live Training Session: Brand Analysis using Social Media Data in R
MIT License
3 stars 7 forks source link

Notebook Review #3

Open adelnehme opened 4 years ago

adelnehme commented 4 years ago

DataCamp icon

Hi @vivekv73y and Sowmya :wave:

Please read the key below to understand how to respond to the feedback provided. Some items will require you to take action while others only need some thought. During this round of feedback, each item with an associated checkbox is an action item that should be implemented before you submit your content for review.


Key

:wrench: This must be fixed during this round of review. This change is necessary to create a good DataCamp live training.

:mag: This wasn't exactly clear. This will need rephrasing or elaborating to fully and clearly convey your meaning.

πŸ“£ This is something you should take a lookout for during the session and verbally explain once you arrive at this point.

:star: Excellent work!


General Feedback

Notebook Review

1. Compare brand popularity by extracting and comparing follower counts

"....The followers count for a twitter account indicates the popularity of that account and is a measure of social media influence.

To extract user data directly from twitter, we usually load the rtweet package - obtain and create Twitter API access tokens according to the instructions in this article - and extract user data with the following code:


# Store name of users to extract data on
users <- c("caranddriver", "motortrend", "autoweekUSA", "roadandtrack")

Extract user data for the twitter accounts stored in users

users_twts <- lookup_users(users)

Save extracted data a CSV file using fwrite from data.table library

fwrite(users_twts, file = "users_twts.csv")

> To avoid setting up individual API access tokens, we will be directly using a CSV file. 

![image](https://user-images.githubusercontent.com/48436758/85121405-ff4ca000-b224-11ea-8562-c8db167b3f59.png)

### 2. Promote a brand by identifying popular tweets using retweet counts

- [x] πŸ”§ **2.** Following up on **1** and **B** here - I think we can have much more deliberate/concise/constructive markdown cells here that explain the point without risking too much cognitive overhead and time spent on markdown content. For example, the following chunk of code/text can be replaced with:

![image](https://user-images.githubusercontent.com/48436758/85141588-45b2f680-b247-11ea-8bb7-bfaa5fb45d57.png)

> To extract tweet data for a particular term, we can use the `search_tweets()` function from `rtweet` which has the following arguments:
>
> - `q`: The query being used, for example `"tesla"`.
> - `n`: The number of tweets
> - `lang`: The language of the tweet - here set to `"en"`
> - `include_rts`: A boolean value that either accepts the inclusion of retweets or resulting data
> In this notebook we will be using a CSV file to import the tweets, but using the `search_tweets()` to extract tweets on `"tesla"` can be done as such
>
```R
# Extract 18000 tweets on Tesla
tweets_tesla = search_tweets("tesla", n = 18000, lang = "en", include_rts = FALSE)
fwrite(tweets_tesla, "tesladf.csv")

Hence looking something like this: image

3a) Visualizing frequency of tweets using time series plots

tesladf$created_at <- as.POSIXct(tesladf$created_at, format = "%Y-%m-%dT%H:%M:%SZ", tz = "GMT")

3b) Compare brand salience for two brands using time series plots and tweet frequencies

image

πŸ” The rest of the notebook is in good shape for a first draft πŸš€ I'll have more thorough feedback on it once all this feedback is implemented in the second round review πŸ˜„

vivekv73y commented 4 years ago

Hi @adelnehme ,

Thanks for the review.

We have implemented all feedback based on your suggestions/inputs above.

Please let us know if you have further feedback.

Thanks Vivek

adelnehme commented 4 years ago

Notebook Review V2

General Feedback

1. Compare brand popularity by extracting and comparing follower count

We can compare followers count for competing products by using their screen names and follower counts.

Note:

  • screen_name: The screen name or twitter handle that an user identifies themselves with
  • followers_count: The number of followers a twitter account currently has.

The followers count for a twitter account indicates the popularity of that account and is a measure of social media influence.

To extract user data directly from twitter, we usually load the rtweet package, obtain and create Twitter API access tokens according to the instructions in this article and extract user data with the lookup_users() function which takes screen names as input and extracts user data from twitter accounts.

image

# Store name of users to extract data on twitter accounts of 4 auto magazines
users <- c("caranddriver", "motortrend", "autoweekUSA", "roadandtrack")

# Extract user data for the twitter accounts stored in users
users_twts <- lookup_users(users)

# Save extracted data as a CSV file using `fwrite()` from`data.table` library
fwrite(users_twts, file = "users_twts.csv")

2. Promote a brand by identifying popular tweets using retweet counts

To extract tweet data for a particular term, we can use the search_tweets() function from rtweet library which has the following arguments:

  • q: The query being used, for example "tesla"
  • n: The number of tweets
  • lang: The language of the tweet - here set to "en"
  • include_rts: A boolean value that either accepts the inclusion of retweets or not on resulting data

In this notebook, we will be using a CSV file to import the tweets but using search_tweets() to extract tweets on "tesla" can be done as such.


# Extract 18000 tweets on Tesla
tweets_tesla = search_tweets("tesla", n = 18000, lang = "en", include_rts = FALSE)

fwrite(tweets_tesla, "tesladf.csv")


![image](https://user-images.githubusercontent.com/48436758/85270894-4b445280-b47a-11ea-8e04-ab917c32fb30.png)

- [x] πŸ”§ **5.**  Replace this text in this markdown here:

> The `text` column usually contains duplicate tweets. We can retain just one version of such tweets by applying the `unique()` function on the `text` column.
> 
> This function takes two arguments:the data frame and the column `text` for removing duplicate tweets.

**with**

> The `text` column usually contains duplicate tweets. To get unique tweets, we can use the `unique()` function which has 2 arguments:
> - The data frame being used
> - `by`: Which columns to search for unique values in 

- [x] πŸ”§ **6.** Delete this markdown text "View the top 6 unique tweets that got the most number of retweets according to the retweets count"

#### 3. Evaluate brand salience and compare the same for two brands using tweet frequencies

- [x] πŸ”§ **7.** The **3** needs boldening πŸ˜„ 

- [x] πŸ”§ **8.** Suggest reframing the absolute first markdown here from "In this exercise, we will be analyzing.." to the following: 

> Brand salience is the extent to which a brand is continuously talked about. Monitoring tweets on a certain brand over time is an excellent proxy to brand salience. Here we will compare how tweets mentioning Tesla vs Toyota are over time. 

#### 3a).  Visualizing frequency of tweets using time series plots

- [x] πŸ”§ **9.** At the risk of not sounding too redundant with the earlier markdown cell, I recommend **deleting** the markdown cell with the contents below - you can always verbally mention that time-series plots are plots over time πŸ˜„ 

> Time series represents a series of data points sequentially indexed over time. Analyzing time series data helps visualize the frequency of tweets over time.
>
> Twitter data can help monitor engagement for a product, indicating levels of interest. Visualizing tweet frequency provides insights into this interest level.
> 
> Let's visualize tweet frequency on the automobile brand "Tesla". We will be using the tweet dataframe created for Tesla in the previous exercise.

- [ ] πŸ”§ **10.** For the `created_at` column, do we need to use `as.POSIXct()` or can we use `as.Date()`?

- [x] πŸ”§ **11.** I recommend changing the contents of this markdown cell (see image below) to
![image](https://user-images.githubusercontent.com/48436758/85272048-f0135f80-b47b-11ea-8593-c07fc553e8a8.png)

> We see the `created_at` column has the timestamp that we'd need to convert to the correct date format using `as.POSIXct()` which takes in:
> - The column being converted
> - `format`: The date format - here to be `""%Y-%m-%dT%H:%M:%SZ"` 
> - `tz`: The time-zone of the conversion

- [x] πŸ“£ **12.** When discussing date formats - make sure you mention that these are easily searchable and students shouldn't waste time memorizing them - feel free to add this as well 
![image](https://user-images.githubusercontent.com/48436758/85272637-b858e780-b47c-11ea-8ef6-043fd0f0dc28.png)

- [x] πŸ”§ **13.** Please change the markdown seen below to 
![image](https://user-images.githubusercontent.com/48436758/85272754-dfafb480-b47c-11ea-8ae9-a69b321c93f8.png)

> To visualize tweets over time, we will use the `rtweet` library's `ts_plot()` function which takes in:
> - The data frame being plotted
> - `by`: The time interval - here `'hours'`
> - `color`: The color of the line

#### 3b) Compare brand salience for two brands using time series plots and tweet frequencies**

- [x] πŸ”§ **14.** Please change the very first markdown cell of this section to:

> Let's compare how tweets mentioning `"Toyota"` compare against `"Tesla"` - here is the `search_tweets()` code used to get tweets on `"Toyota"`

```R
# Extract tweets for Toyota using `search_tweets()`

tweets_toyo = search_tweets("toyota", n = 18000, lang = "en",  include_rts = FALSE)

fwrite(tweets_toyo, file = "toyotadf.csv")

image

To visualize the number of tweets over time - we aggregate both toyotadf and tesladf into time-series objects using ts_data() - which takes in 2 arguments:

  • The data frame being converted
  • by: The time interval of frequency counting (here 'hours').

image

image image

4. Understand brand perception through text mining and by visualizing key terms

I recommend a re-imagine of the markdown in this section and how it's divided up into small subsections. Here's a collection of points aimed at addressing this

image

One of the most important and common tasks in social media data analysis is being able to understand what users are tweeting about the most and how they perceive a particular brand. In this section, we will visualize the most common words mentioning "Tesla" to build a word-cloud that showcases the most common words.

4a) Text mining by processing twitter text

image

Tweets are unstructured, noisy, and raw, and properly processing them is essentially to accurately capture useful brand-perception information. Here are some processing steps we'll be performing:

  • Step 1: Remove redundant information including URLs, special characters, punctuation and numbers.
  • Step 2: Convert the text to a Corpus (i.e. large document of text)
  • Step 3: Convert all letters in the Corpus to lower case.
  • Step 4: Trim leading and trailing spaces from Corpus (Adel: I felt like it makes more sense to have the white space trimming after the lower case conversion - does that make sense to you?)
  • Step 5: Remove common words (the, a, and ...), also called stop words, from the Corpus.
  • Step 6: Remove custom stop words from the Corpus.

i) Remove URLs and characters other than letters

image

ii) Replace special characters, punctuations and numbers

To remove special characters, punctuation, and numbers, we will use the gsub() function which takes in:

  • The pattern to search for - for example if we are searching for non-numbers and non-letters, the regular expression "[^A-Za-z]" is a pattern.
  • The character to replace it with.
  • The text source here twt_txt_url.

iii) Build a corpus

image

A Corpus is a list of text documents and is often used in text processing functions. To create a corpus, we will be using the tm library and the functions VectorSource() and Corpus(). The VectorSource() converts the tweet text to a vector of texts and the Corpus() function takes the output of VectorSource() and converts to a Corpus. An example on a tweets object would be:

library(tm)
my_corpus <- tweets %>% 
                        VectorSource() %>% 
                         Corpus() 

The end result would look like this

image

iv) Convert corpus to lowercase

To have all words in our corpus being uniform, we will lower all words in the Corpus to lower case ('Tesla' vs 'tesla'). To do this, will use the tm_map() function which applies a transformation to the corpus. In this case, it takes in 2 arguments:

  • The corpus being transformed
  • The transformation itself, stored in the tolower() function

v) Remove stop words from the Corpus

Stop words are commonly used words like "a", "an", "the" etc... They are often the most common and tend to skew your analysis if left in the corpus. We will remove English stop words from the Corpus by using tm_map() - which takes in this case 3 arguments:

  • The corpus being transformed
  • The transformation itself, stored in removeWords().
  • The English stop words to be removed, stored in stopwords("english").

vi) Remove additional spaces from the Corpus

vii) Remove custom stop words from the Corpus

4b) Understand brand perception by visualizing key terms in the corpus

5.Sentiment analysis of tweets to understand customer's feelings and sentiments about a brand

vivekv73y commented 4 years ago

Hi @adelnehme ,

I have updated almost all the points except a few which are partially updated or not updated for which I have given responses below:

1) transferring contents to a new notebook: Response: Can you please review this updated notebook first for any final comments and I will create a new version as soon as we (almost) finalize for any major corrections. Please just let me know when we are ready to create a new version without executing the cells and I will do it immediately.

10) For the created_at column, do we need to use as.POSIXct() or can we use as.Date()? Response: I am not sure myself so I have just retained the POSIX code. Is that ok?

22) Step 4: Trim leading and trailing spaces from Corpus (Adel: I felt like it makes more sense to have the white space trimming after the lower case conversion - does that make sense to you?) Response: The latter steps i.e. removing common words and removing custom stop words leave lot of additional white spaces in the corpus. So, I feel it is better to keep the original version and retain white space trimming as the last step. Hope this is ok. All other suggestions under this point have been carried out.

31 Check out my feedback on 22 and where I think this should fall in the outline of the session. 32 Does that work for you? 32: Check out my feedback on 22 and where I think this should fall in the outline of the session. Does that work for you? Response: See response above for point 22.

Thanks and please let me know if you have any questions or comments.

Best wishes Vivek

vivekv73y commented 4 years ago

Hi @adelnehme , Unfortunately, the serial numbers got auto-numbered. The points 1, 2 & 3 are for 1, 10 & 22 in your feedback respectively.

Thanks Vivek

adelnehme commented 4 years ago

Hi @vivekv73y :wave:

Everything's looking great from my end! The notebook's looking great πŸ˜„ I have 2 points off feedback before we transition to a new notebook:

1) Adding a Q&A section between each major section - feel free to add it as a markdown cell with this code:

---
<center><h1> Q&A 1</h1> </center>

---

image

2) In section 3a) - can you use the following HTML src function for visualizing the image with the different time formats - the align and width arguments let you play around with the position and the size. I recommend keeping width to 50% πŸ˜„

<p align="left">
<img src="https://github.com/datacamp/Brand-Analysis-using-Social-Media-Data-in-R-Live-Training/blob/master/data/striptime.png?raw=true" alt = "" width="50%">
</p>

3) For package installations - would it be possible to install cran packages from binaries? Meaning that the installation package cell should look like the following? This will make it much faster (~60s) to install the packages πŸ˜„

system('apt-get install r-cran-httpuv r-cran-rtweet r-cran-reshape r-cran-qdap r-cran-tm r-cran-qdapregex')
install.packages('syuzhet')
vivekv73y commented 4 years ago

Hi @adelnehme ,

Thanks for reviewing the notebook again and providing final comments. Point 1: updated. Point 2: updated. I have set width to 40% as even 50% looks large. Hope it is ok or else please let me know and I can update to 50%. Point 3: updated. Can you please check if the coding is correct as suggested above.

Finally, I have copied all cells into a new notebook without executing those cells. Can you review using the notebook called, live_session_solution.ipynb. We will use this file going forward.

Thanks again and please let me know if you have any comments.

Best wishes Vivek

vivekv73y commented 4 years ago

Hi @adelnehme ,

I hope you are well.

I looked at your presentation on python for spreadsheet users and noticed that you have not used animation to click through the points on any of the slides. Is there any reason to do so and do you advise me to follow the same setup in my session?

Thanks Vivek

vivekv73y commented 4 years ago

Hi @adelnehme ,

Another question: Do you recommend having Google colab environment in white background mode with black text or "dark" setting with black background and white text. I saw your presentation and it had white background with black text. Please advise. << just to add: I like and prefer the "dark" setup as it highlights the arguments well in markdowns but can work with the light background as well :)

Thanks Vivek

adelnehme commented 4 years ago

Hi @vivekv73y :wave:

Thanks for reaching out!

re: slides

I personally did not find a use-case for using animations during my webinars - but if you like it please feel free to approach it this way πŸ˜„

re: notebook dark mode

If you prefer to use the notebook with dark mode during the session I do not see a problem in trying it out - however I'd keep the link as default (white mode) when it opens and then you can state as you open up the notebook that you like dark mode so you will be setting it this way and that students should feel free to follow suit or not πŸ˜„

Hope this helps!

vivekv73y commented 4 years ago

These inputs definitely help.

Thanks a lot, @adelnehme . Vivek