adelnehme commented 4 years ago

Hi @vivekv73y and Sowmya :wave:

Please read the key below to understand how to respond to the feedback provided. Some items will require you to take action while others only need some thought. During this round of feedback, each item with an associated checkbox is an action item that should be implemented before you submit your content for review.

Key

:wrench: This must be fixed during this round of review. This change is necessary to create a good DataCamp live training.

:mag: This wasn't exactly clear. This will need rephrasing or elaborating to fully and clearly convey your meaning.

📣 This is something you should take a lookout for during the session and verbally explain once you arrive at this point.

:star: Excellent work!

General Feedback

[x] 🔧 A. I recently learned that tidyverse is pre-loaded in our colabs environment - so make sure to install non-tidyverse packages only 😄
[x] 🔧 B. Check out points 1 and 2 and the usage of markdown to explain and showcase lookup_users(). I recommend doing the same for each time you introduce anrtweet function we will not be using because of API limitations and to find opportunities to make markdown slightly more concise 😄
[x] 🔧 C. In order to give you more time to go deeper into the first 5 sections - I recommend dropping section 6 and instead showcase that this could be done in the final slides - and that students should take your course to figure it out 😄
[x] 🔧 D. Make sure all sections # - subsections ## and sub-subsections ### are all in bold font.

Notebook Review

1. Compare brand popularity by extracting and comparing follower counts

[x] 🔧 1. To avoid students getting distracted with code that will generate an error during your session, I recommend just setting this entire section in the image below to a markdown cell. For example, could be something along the lines of:

"....The followers count for a twitter account indicates the popularity of that account and is a measure of social media influence.

To extract user data directly from twitter, we usually load the rtweet package - obtain and create Twitter API access tokens according to the instructions in this article - and extract user data with the following code:
# Store name of users to extract data on
users <- c("caranddriver", "motortrend", "autoweekUSA", "roadandtrack")

Extract user data for the twitter accounts stored in users

users_twts <- lookup_users(users)

Save extracted data a CSV file using fwrite from data.table library

fwrite(users_twts, file = "users_twts.csv")

> To avoid setting up individual API access tokens, we will be directly using a CSV file. 

![image](https://user-images.githubusercontent.com/48436758/85121405-ff4ca000-b224-11ea-8562-c8db167b3f59.png)

### 2. Promote a brand by identifying popular tweets using retweet counts

- [x] 🔧 **2.** Following up on **1** and **B** here - I think we can have much more deliberate/concise/constructive markdown cells here that explain the point without risking too much cognitive overhead and time spent on markdown content. For example, the following chunk of code/text can be replaced with:

![image](https://user-images.githubusercontent.com/48436758/85141588-45b2f680-b247-11ea-8bb7-bfaa5fb45d57.png)

> To extract tweet data for a particular term, we can use the `search_tweets()` function from `rtweet` which has the following arguments:
>
> - `q`: The query being used, for example `"tesla"`.
> - `n`: The number of tweets
> - `lang`: The language of the tweet - here set to `"en"`
> - `include_rts`: A boolean value that either accepts the inclusion of retweets or resulting data
> In this notebook we will be using a CSV file to import the tweets, but using the `search_tweets()` to extract tweets on `"tesla"` can be done as such
>
```R
# Extract 18000 tweets on Tesla
tweets_tesla = search_tweets("tesla", n = 18000, lang = "en", include_rts = FALSE)
fwrite(tweets_tesla, "tesladf.csv")

Hence looking something like this:

3a) Visualizing frequency of tweets using time series plots

[x] 🤔 3. I think it would reduce cognitive overload slightly if we abandon the use of a function to convert the date - and immediately go for using the code below - this would be applied to any subsequent use of the format.git.date() function.

tesladf$created_at <- as.POSIXct(tesladf$created_at, format = "%Y-%m-%dT%H:%M:%SZ", tz = "GMT")

3b) Compare brand salience for two brands using time series plots and tweet frequencies

[x] 🔧 4. Consider abandoning the code cell with search_tweets() for Toyota - you can always verbally mention that it has been extracted the same way we extracted tesla earlier 😄
[x] 🔧 5. When you break down a function with its arguments like in the photo below - make sure to always use bullet points for arguments. This is applicable for all sections

🔍 The rest of the notebook is in good shape for a first draft 🚀 I'll have more thorough feedback on it once all this feedback is implemented in the second round review 😄

vivekv73y commented 4 years ago

Hi @adelnehme ,

Thanks for the review.

We have implemented all feedback based on your suggestions/inputs above.

Please let us know if you have further feedback.

Thanks Vivek

adelnehme commented 4 years ago

Notebook Review V2

General Feedback

[ ] 🔧 1. I noticed this bug with R Notebooks is that usually R Notebooks with output cells sometimes bug and become corrupted when we create links out of them for students. Do you mind transferring all the contents of this notebook to new one without running the cells please? I seem to be suffering from extremely long install times on the install packages section and my hunch is because of that 🤔

1. Compare brand popularity by extracting and comparing follower count

[x] 🔧 1. In the markdown section, can you please link the rtweet article using [ ]()(see image below) to hyperlink the documentation to the word "article"
[x] 🔧 2. I recommend using the following text in the markdown:

We can compare followers count for competing products by using their screen names and follower counts.

Note:

screen_name: The screen name or twitter handle that an user identifies themselves with

followers_count: The number of followers a twitter account currently has.

The followers count for a twitter account indicates the popularity of that account and is a measure of social media influence.

To extract user data directly from twitter, we usually load the rtweet package, obtain and create Twitter API access tokens according to the instructions in this article and extract user data with the lookup_users() function which takes screen names as input and extracts user data from twitter accounts.

[x] 🔧 3. Can you please use inline formatting with triple backticks like this to format R code:

# Store name of users to extract data on twitter accounts of 4 auto magazines
users <- c("caranddriver", "motortrend", "autoweekUSA", "roadandtrack")

# Extract user data for the twitter accounts stored in users
users_twts <- lookup_users(users)

# Save extracted data as a CSV file using `fwrite()` from`data.table` library
fwrite(users_twts, file = "users_twts.csv")

2. Promote a brand by identifying popular tweets using retweet counts

[x] 🔧 4. Please use inline formatting with ` to format argument names (likeq,n,lang`...) as well as triple backticks similar to 3 for formatting the code here. The final product should like this

To extract tweet data for a particular term, we can use the search_tweets() function from rtweet library which has the following arguments:

q: The query being used, for example "tesla"

n: The number of tweets

lang: The language of the tweet - here set to "en"

include_rts: A boolean value that either accepts the inclusion of retweets or not on resulting data

In this notebook, we will be using a CSV file to import the tweets but using search_tweets() to extract tweets on "tesla" can be done as such.
# Extract 18000 tweets on Tesla
tweets_tesla = search_tweets("tesla", n = 18000, lang = "en", include_rts = FALSE)

fwrite(tweets_tesla, "tesladf.csv")


![image](https://user-images.githubusercontent.com/48436758/85270894-4b445280-b47a-11ea-8e04-ab917c32fb30.png)

- [x] 🔧 **5.**  Replace this text in this markdown here:

> The `text` column usually contains duplicate tweets. We can retain just one version of such tweets by applying the `unique()` function on the `text` column.
> 
> This function takes two arguments:the data frame and the column `text` for removing duplicate tweets.

**with**

> The `text` column usually contains duplicate tweets. To get unique tweets, we can use the `unique()` function which has 2 arguments:
> - The data frame being used
> - `by`: Which columns to search for unique values in 

- [x] 🔧 **6.** Delete this markdown text "View the top 6 unique tweets that got the most number of retweets according to the retweets count"

#### 3. Evaluate brand salience and compare the same for two brands using tweet frequencies

- [x] 🔧 **7.** The **3** needs boldening 😄 

- [x] 🔧 **8.** Suggest reframing the absolute first markdown here from "In this exercise, we will be analyzing.." to the following: 

> Brand salience is the extent to which a brand is continuously talked about. Monitoring tweets on a certain brand over time is an excellent proxy to brand salience. Here we will compare how tweets mentioning Tesla vs Toyota are over time. 

#### 3a).  Visualizing frequency of tweets using time series plots

- [x] 🔧 **9.** At the risk of not sounding too redundant with the earlier markdown cell, I recommend **deleting** the markdown cell with the contents below - you can always verbally mention that time-series plots are plots over time 😄 

> Time series represents a series of data points sequentially indexed over time. Analyzing time series data helps visualize the frequency of tweets over time.
>
> Twitter data can help monitor engagement for a product, indicating levels of interest. Visualizing tweet frequency provides insights into this interest level.
> 
> Let's visualize tweet frequency on the automobile brand "Tesla". We will be using the tweet dataframe created for Tesla in the previous exercise.

- [ ] 🔧 **10.** For the `created_at` column, do we need to use `as.POSIXct()` or can we use `as.Date()`?

- [x] 🔧 **11.** I recommend changing the contents of this markdown cell (see image below) to
![image](https://user-images.githubusercontent.com/48436758/85272048-f0135f80-b47b-11ea-8593-c07fc553e8a8.png)

> We see the `created_at` column has the timestamp that we'd need to convert to the correct date format using `as.POSIXct()` which takes in:
> - The column being converted
> - `format`: The date format - here to be `""%Y-%m-%dT%H:%M:%SZ"` 
> - `tz`: The time-zone of the conversion

- [x] 📣 **12.** When discussing date formats - make sure you mention that these are easily searchable and students shouldn't waste time memorizing them - feel free to add this as well 
![image](https://user-images.githubusercontent.com/48436758/85272637-b858e780-b47c-11ea-8ef6-043fd0f0dc28.png)

- [x] 🔧 **13.** Please change the markdown seen below to 
![image](https://user-images.githubusercontent.com/48436758/85272754-dfafb480-b47c-11ea-8ae9-a69b321c93f8.png)

> To visualize tweets over time, we will use the `rtweet` library's `ts_plot()` function which takes in:
> - The data frame being plotted
> - `by`: The time interval - here `'hours'`
> - `color`: The color of the line

#### 3b) Compare brand salience for two brands using time series plots and tweet frequencies**

- [x] 🔧 **14.** Please change the very first markdown cell of this section to:

> Let's compare how tweets mentioning `"Toyota"` compare against `"Tesla"` - here is the `search_tweets()` code used to get tweets on `"Toyota"`

```R
# Extract tweets for Toyota using `search_tweets()`

tweets_toyo = search_tweets("toyota", n = 18000, lang = "en",  include_rts = FALSE)

fwrite(tweets_toyo, file = "toyotadf.csv")

[x] 🔧 15. There is no need to mention in markdown pre-saved CSV ... since students already have experienced this and you can verbally say it.
[x] 🔧 16. Similar to 15 - no need to use markdown here that we need to update dates - verbally and using the code comments here are fine.
[x] 🔧 17. Consider replacing the markdown in the image below with the following

To visualize the number of tweets over time - we aggregate both toyotadf and tesladf into time-series objects using ts_data() - which takes in 2 arguments:

The data frame being converted

by: The time interval of frequency counting (here 'hours').

[x] 🔧 18. Given 17 - I recommend deleting intermediary markdown text between the code cells where we create a ts object out of tesla and toyota so that the end result looks like this

[x] 🔧 19. Remove "," at the end of each sentence for both images - (also for melt() - make sure to use inline formatting when introducing the reshape library and the melt() function).

4. Understand brand perception through text mining and by visualizing key terms

I recommend a re-imagine of the markdown in this section and how it's divided up into small subsections. Here's a collection of points aimed at addressing this

[x] 🔧 20. I recommend changing the markdown here to:

One of the most important and common tasks in social media data analysis is being able to understand what users are tweeting about the most and how they perceive a particular brand. In this section, we will visualize the most common words mentioning "Tesla" to build a word-cloud that showcases the most common words.

4a) Text mining by processing twitter text

[x] 🔧 21. Recommend changing this sub-sub section name to "Processing tweets and twitter data"
[x] 🔧 22. Recommend changing the markdown section here to:

Tweets are unstructured, noisy, and raw, and properly processing them is essentially to accurately capture useful brand-perception information. Here are some processing steps we'll be performing:

Step 1: Remove redundant information including URLs, special characters, punctuation and numbers.

Step 2: Convert the text to a Corpus (i.e. large document of text)

Step 3: Convert all letters in the Corpus to lower case.

Step 4: Trim leading and trailing spaces from Corpus (Adel: I felt like it makes more sense to have the white space trimming after the lower case conversion - does that make sense to you?)

Step 5: Remove common words (the, a, and ...), also called stop words, from the Corpus.

Step 6: Remove custom stop words from the Corpus.

i) Remove URLs and characters other than letters

[x] 🔧 23. Instead of using "#### i) Remove URLs and characters other than letters" for each step's section - recommend just using bold sentences like this Step 1: Remove URLs

ii) Replace special characters, punctuations and numbers

[x] 🔧 24. Same as 23 - making it Step 2: Remove special characters, punctuations and numbers - also, recommend using this markdown to explain gsub()

To remove special characters, punctuation, and numbers, we will use the gsub() function which takes in:

The pattern to search for - for example if we are searching for non-numbers and non-letters, the regular expression "[^A-Za-z]" is a pattern.

The character to replace it with.

The text source here twt_txt_url.

[x] 📣 25. Make sure to remind students here what regular expressions are (I like control + F but on steroids 😄 )

iii) Build a corpus

[x] 🔧 26. Same as 23 and 24 - making it Step 3: Building a Corpus
[x] 🔧 27. I recommend replacing this entire section in this image by one markdown cell - the following:

A Corpus is a list of text documents and is often used in text processing functions. To create a corpus, we will be using the tm library and the functions VectorSource() and Corpus(). The VectorSource() converts the tweet text to a vector of texts and the Corpus() function takes the output of VectorSource() and converts to a Corpus. An example on a tweets object would be:

library(tm)
my_corpus <- tweets %>% 
                        VectorSource() %>% 
                         Corpus()

The end result would look like this

iv) Convert corpus to lowercase

[x] 🔧 28. Make sure to use the Step 4: Convert Corpus to lower case
[x] 🔧 29. I recommend changing the markdown here to focus on the tm_map() function. For example, the markdown section would become:

To have all words in our corpus being uniform, we will lower all words in the Corpus to lower case ('Tesla' vs 'tesla'). To do this, will use the tm_map() function which applies a transformation to the corpus. In this case, it takes in 2 arguments:

The corpus being transformed

The transformation itself, stored in the tolower() function

v) Remove stop words from the Corpus

[x] 🔧 30. Given that 29 explained tm_map() - recommend changing the markdown here to:

Stop words are commonly used words like "a", "an", "the" etc... They are often the most common and tend to skew your analysis if left in the corpus. We will remove English stop words from the Corpus by using tm_map() - which takes in this case 3 arguments:

The corpus being transformed

The transformation itself, stored in removeWords().

The English stop words to be removed, stored in stopwords("english").

vi) Remove additional spaces from the Corpus

[ ] 🔧 31. Check out my feedback on 22 and where I think this should fall in the outline of the session. Does that work for you? (Also make sure to change its name to Step ...

vii) Remove custom stop words from the Corpus

[ ] 🔧 32. Check out my feedback on 22 and where I think this should fall in the outline of the session. Does that work for you? (Also make sure to change its name to Step ...

4b) Understand brand perception by visualizing key terms in the corpus

[x] 🔧 33. I recommend having in this section only 1 or 2 wordlcloud examples (without barplots) to give you more time to focus deeply on certain areas instead of having too much breadth.
[x] 🔧 34. Recommend renaming this section to "Visualizing brand perception"

5.Sentiment analysis of tweets to understand customer's feelings and sentiments about a brand

[x] 🔧 35. Consider changing this section to "Further understanding brand perception by analyzing tweet sentiment"
[x] 🔧 36. Check out 22 and how we used the Step 1:.. for steps in building a corpus - I recommend using the same for sentiment analysis as well as 29 on how functions have been presented - I recommend doing the same approach for this section here 😄

vivekv73y commented 4 years ago

Hi @adelnehme ,

I have updated almost all the points except a few which are partially updated or not updated for which I have given responses below:

1) transferring contents to a new notebook: Response: Can you please review this updated notebook first for any final comments and I will create a new version as soon as we (almost) finalize for any major corrections. Please just let me know when we are ready to create a new version without executing the cells and I will do it immediately.

10) For the created_at column, do we need to use as.POSIXct() or can we use as.Date()? Response: I am not sure myself so I have just retained the POSIX code. Is that ok?

22) Step 4: Trim leading and trailing spaces from Corpus (Adel: I felt like it makes more sense to have the white space trimming after the lower case conversion - does that make sense to you?) Response: The latter steps i.e. removing common words and removing custom stop words leave lot of additional white spaces in the corpus. So, I feel it is better to keep the original version and retain white space trimming as the last step. Hope this is ok. All other suggestions under this point have been carried out.

31 Check out my feedback on 22 and where I think this should fall in the outline of the session. 32 Does that work for you? 32: Check out my feedback on 22 and where I think this should fall in the outline of the session. Does that work for you? Response: See response above for point 22.

Thanks and please let me know if you have any questions or comments.

Best wishes Vivek

vivekv73y commented 4 years ago

Hi @adelnehme , Unfortunately, the serial numbers got auto-numbered. The points 1, 2 & 3 are for 1, 10 & 22 in your feedback respectively.

Thanks Vivek

adelnehme commented 4 years ago

Hi @vivekv73y :wave:

Everything's looking great from my end! The notebook's looking great 😄 I have 2 points off feedback before we transition to a new notebook:

1) Adding a Q&A section between each major section - feel free to add it as a markdown cell with this code:

---
<center><h1> Q&A 1</h1> </center>

---

2) In section 3a) - can you use the following HTML src function for visualizing the image with the different time formats - the align and width arguments let you play around with the position and the size. I recommend keeping width to 50% 😄

<p align="left">
<img src="https://github.com/datacamp/Brand-Analysis-using-Social-Media-Data-in-R-Live-Training/blob/master/data/striptime.png?raw=true" alt = "" width="50%">
</p>

3) For package installations - would it be possible to install cran packages from binaries? Meaning that the installation package cell should look like the following? This will make it much faster (~60s) to install the packages 😄

system('apt-get install r-cran-httpuv r-cran-rtweet r-cran-reshape r-cran-qdap r-cran-tm r-cran-qdapregex')
install.packages('syuzhet')

vivekv73y commented 4 years ago

Hi @adelnehme ,

Thanks for reviewing the notebook again and providing final comments. Point 1: updated. Point 2: updated. I have set width to 40% as even 50% looks large. Hope it is ok or else please let me know and I can update to 50%. Point 3: updated. Can you please check if the coding is correct as suggested above.

Finally, I have copied all cells into a new notebook without executing those cells. Can you review using the notebook called, live_session_solution.ipynb. We will use this file going forward.

Thanks again and please let me know if you have any comments.

Best wishes Vivek

vivekv73y commented 4 years ago

Hi @adelnehme ,

I hope you are well.

I looked at your presentation on python for spreadsheet users and noticed that you have not used animation to click through the points on any of the slides. Is there any reason to do so and do you advise me to follow the same setup in my session?

Thanks Vivek

vivekv73y commented 4 years ago

Hi @adelnehme ,

Another question: Do you recommend having Google colab environment in white background mode with black text or "dark" setting with black background and white text. I saw your presentation and it had white background with black text. Please advise. << just to add: I like and prefer the "dark" setup as it highlights the arguments well in markdowns but can work with the light background as well :)

Thanks Vivek

adelnehme commented 4 years ago

Hi @vivekv73y :wave:

Thanks for reaching out!

re: slides

I personally did not find a use-case for using animations during my webinars - but if you like it please feel free to approach it this way 😄

re: notebook dark mode

If you prefer to use the notebook with dark mode during the session I do not see a problem in trying it out - however I'd keep the link as default (white mode) when it opens and then you can state as you open up the notebook that you like dark mode so you will be setting it this way and that students should feel free to follow suit or not 😄

Hope this helps!

vivekv73y commented 4 years ago

These inputs definitely help.

Thanks a lot, @adelnehme . Vivek

datacamp / Brand-Analysis-using-Social-Media-Data-in-R-Live-Training

Notebook Review #3

Key

General Feedback

Notebook Review

1. Compare brand popularity by extracting and comparing follower counts

Extract user data for the twitter accounts stored in users

Save extracted data a CSV file using fwrite from data.table library

3a) Visualizing frequency of tweets using time series plots

3b) Compare brand salience for two brands using time series plots and tweet frequencies

Notebook Review V2

General Feedback

1. Compare brand popularity by extracting and comparing follower count

2. Promote a brand by identifying popular tweets using retweet counts

4. Understand brand perception through text mining and by visualizing key terms

4a) Text mining by processing twitter text

i) Remove URLs and characters other than letters

ii) Replace special characters, punctuations and numbers

iii) Build a corpus

iv) Convert corpus to lowercase

v) Remove stop words from the Corpus

vi) Remove additional spaces from the Corpus

vii) Remove custom stop words from the Corpus

4b) Understand brand perception by visualizing key terms in the corpus

5.Sentiment analysis of tweets to understand customer's feelings and sentiments about a brand