Open adelnehme opened 4 years ago
Hi @adelnehme ,
Thanks for the review.
We have implemented all feedback based on your suggestions/inputs above.
Please let us know if you have further feedback.
Thanks Vivek
[x] π§ 1. In the markdown section, can you please link the rtweet article using [ ]()
(see image below) to hyperlink the documentation to the word "article"
[x] π§ 2. I recommend using the following text in the markdown:
We can compare followers count for competing products by using their screen names and follower counts.
Note:
screen_name
: The screen name or twitter handle that an user identifies themselves withfollowers_count
: The number of followers a twitter account currently has.The followers count for a twitter account indicates the popularity of that account and is a measure of social media influence.
To extract user data directly from twitter, we usually load the
rtweet
package, obtain and create Twitter API access tokens according to the instructions in this article and extract user data with thelookup_users()
function which takes screen names as input and extracts user data from twitter accounts.
# Store name of users to extract data on twitter accounts of 4 auto magazines
users <- c("caranddriver", "motortrend", "autoweekUSA", "roadandtrack")
# Extract user data for the twitter accounts stored in users
users_twts <- lookup_users(users)
# Save extracted data as a CSV file using `fwrite()` from`data.table` library
fwrite(users_twts, file = "users_twts.csv")
` to format argument names (like
q,
n,
lang`...) as well as triple backticks similar to 3 for formatting the code here. The final product should like this To extract tweet data for a particular term, we can use the
search_tweets()
function fromrtweet
library which has the following arguments:
q
: The query being used, for example"tesla"
n
: The number of tweetslang
: The language of the tweet - here set to"en"
include_rts
: A boolean value that either accepts the inclusion of retweets or not on resulting dataIn this notebook, we will be using a CSV file to import the tweets but using
search_tweets()
to extract tweets on"tesla"
can be done as such.# Extract 18000 tweets on Tesla tweets_tesla = search_tweets("tesla", n = 18000, lang = "en", include_rts = FALSE)
fwrite(tweets_tesla, "tesladf.csv")
![image](https://user-images.githubusercontent.com/48436758/85270894-4b445280-b47a-11ea-8e04-ab917c32fb30.png)
- [x] π§ **5.** Replace this text in this markdown here:
> The `text` column usually contains duplicate tweets. We can retain just one version of such tweets by applying the `unique()` function on the `text` column.
>
> This function takes two arguments:the data frame and the column `text` for removing duplicate tweets.
**with**
> The `text` column usually contains duplicate tweets. To get unique tweets, we can use the `unique()` function which has 2 arguments:
> - The data frame being used
> - `by`: Which columns to search for unique values in
- [x] π§ **6.** Delete this markdown text "View the top 6 unique tweets that got the most number of retweets according to the retweets count"
#### 3. Evaluate brand salience and compare the same for two brands using tweet frequencies
- [x] π§ **7.** The **3** needs boldening π
- [x] π§ **8.** Suggest reframing the absolute first markdown here from "In this exercise, we will be analyzing.." to the following:
> Brand salience is the extent to which a brand is continuously talked about. Monitoring tweets on a certain brand over time is an excellent proxy to brand salience. Here we will compare how tweets mentioning Tesla vs Toyota are over time.
#### 3a). Visualizing frequency of tweets using time series plots
- [x] π§ **9.** At the risk of not sounding too redundant with the earlier markdown cell, I recommend **deleting** the markdown cell with the contents below - you can always verbally mention that time-series plots are plots over time π
> Time series represents a series of data points sequentially indexed over time. Analyzing time series data helps visualize the frequency of tweets over time.
>
> Twitter data can help monitor engagement for a product, indicating levels of interest. Visualizing tweet frequency provides insights into this interest level.
>
> Let's visualize tweet frequency on the automobile brand "Tesla". We will be using the tweet dataframe created for Tesla in the previous exercise.
- [ ] π§ **10.** For the `created_at` column, do we need to use `as.POSIXct()` or can we use `as.Date()`?
- [x] π§ **11.** I recommend changing the contents of this markdown cell (see image below) to
![image](https://user-images.githubusercontent.com/48436758/85272048-f0135f80-b47b-11ea-8593-c07fc553e8a8.png)
> We see the `created_at` column has the timestamp that we'd need to convert to the correct date format using `as.POSIXct()` which takes in:
> - The column being converted
> - `format`: The date format - here to be `""%Y-%m-%dT%H:%M:%SZ"`
> - `tz`: The time-zone of the conversion
- [x] π£ **12.** When discussing date formats - make sure you mention that these are easily searchable and students shouldn't waste time memorizing them - feel free to add this as well
![image](https://user-images.githubusercontent.com/48436758/85272637-b858e780-b47c-11ea-8ef6-043fd0f0dc28.png)
- [x] π§ **13.** Please change the markdown seen below to
![image](https://user-images.githubusercontent.com/48436758/85272754-dfafb480-b47c-11ea-8ae9-a69b321c93f8.png)
> To visualize tweets over time, we will use the `rtweet` library's `ts_plot()` function which takes in:
> - The data frame being plotted
> - `by`: The time interval - here `'hours'`
> - `color`: The color of the line
#### 3b) Compare brand salience for two brands using time series plots and tweet frequencies**
- [x] π§ **14.** Please change the very first markdown cell of this section to:
> Let's compare how tweets mentioning `"Toyota"` compare against `"Tesla"` - here is the `search_tweets()` code used to get tweets on `"Toyota"`
```R
# Extract tweets for Toyota using `search_tweets()`
tweets_toyo = search_tweets("toyota", n = 18000, lang = "en", include_rts = FALSE)
fwrite(tweets_toyo, file = "toyotadf.csv")
[x] π§ 15. There is no need to mention in markdown pre-saved CSV ... since students already have experienced this and you can verbally say it.
[x] π§ 16. Similar to 15 - no need to use markdown here that we need to update dates - verbally and using the code comments here are fine.
[x] π§ 17. Consider replacing the markdown in the image below with the following
To visualize the number of tweets over time - we aggregate both
toyotadf
andtesladf
into time-series objects usingts_data()
- which takes in 2 arguments:
- The data frame being converted
by
: The time interval of frequency counting (here'hours'
).
ts
object out of tesla
and toyota
so that the end result looks like thismelt()
- make sure to use inline formatting when introducing the reshape library and the melt() function).
I recommend a re-imagine of the markdown in this section and how it's divided up into small subsections. Here's a collection of points aimed at addressing this
One of the most important and common tasks in social media data analysis is being able to understand what users are tweeting about the most and how they perceive a particular brand. In this section, we will visualize the most common words mentioning
"Tesla"
to build a word-cloud that showcases the most common words.
[x] π§ 21. Recommend changing this sub-sub section name to "Processing tweets and twitter data"
[x] π§ 22. Recommend changing the markdown section here to:
Tweets are unstructured, noisy, and raw, and properly processing them is essentially to accurately capture useful brand-perception information. Here are some processing steps we'll be performing:
- Step 1: Remove redundant information including URLs, special characters, punctuation and numbers.
- Step 2: Convert the text to a Corpus (i.e. large document of text)
- Step 3: Convert all letters in the Corpus to lower case.
- Step 4: Trim leading and trailing spaces from Corpus (Adel: I felt like it makes more sense to have the white space trimming after the lower case conversion - does that make sense to you?)
- Step 5: Remove common words (the, a, and ...), also called stop words, from the Corpus.
- Step 6: Remove custom stop words from the Corpus.
gsub()
To remove special characters, punctuation, and numbers, we will use the
gsub()
function which takes in:
- The pattern to search for - for example if we are searching for non-numbers and non-letters, the regular expression
"[^A-Za-z]"
is a pattern.- The character to replace it with.
- The text source here
twt_txt_url
.
[x] π§ 26. Same as 23 and 24 - making it Step 3: Building a Corpus
[x] π§ 27. I recommend replacing this entire section in this image by one markdown cell - the following:
A Corpus is a list of text documents and is often used in text processing functions. To create a corpus, we will be using the
tm
library and the functionsVectorSource()
andCorpus()
. TheVectorSource()
converts the tweet text to a vector of texts and theCorpus()
function takes the output ofVectorSource()
and converts to a Corpus. An example on atweets
object would be:
library(tm)
my_corpus <- tweets %>%
VectorSource() %>%
Corpus()
The end result would look like this
[x] π§ 28. Make sure to use the Step 4: Convert Corpus to lower case
[x] π§ 29. I recommend changing the markdown here to focus on the tm_map()
function. For example, the markdown section would become:
To have all words in our corpus being uniform, we will lower all words in the Corpus to lower case (
'Tesla'
vs'tesla'
). To do this, will use thetm_map()
function which applies a transformation to the corpus. In this case, it takes in 2 arguments:
- The corpus being transformed
- The transformation itself, stored in the
tolower()
function
tm_map()
- recommend changing the markdown here to:Stop words are commonly used words like
"a"
,"an"
,"the"
etc... They are often the most common and tend to skew your analysis if left in the corpus. We will remove English stop words from the Corpus by usingtm_map()
- which takes in this case 3 arguments:
- The corpus being transformed
- The transformation itself, stored in
removeWords()
.- The English stop words to be removed, stored in
stopwords("english")
.
[x] π§ 33. I recommend having in this section only 1 or 2 wordlcloud examples (without barplots) to give you more time to focus deeply on certain areas instead of having too much breadth.
[x] π§ 34. Recommend renaming this section to "Visualizing brand perception"
[x] π§ 35. Consider changing this section to "Further understanding brand perception by analyzing tweet sentiment"
[x] π§ 36. Check out 22 and how we used the Step 1:.. for steps in building a corpus - I recommend using the same for sentiment analysis as well as 29 on how functions have been presented - I recommend doing the same approach for this section here π
Hi @adelnehme ,
I have updated almost all the points except a few which are partially updated or not updated for which I have given responses below:
1) transferring contents to a new notebook: Response: Can you please review this updated notebook first for any final comments and I will create a new version as soon as we (almost) finalize for any major corrections. Please just let me know when we are ready to create a new version without executing the cells and I will do it immediately.
10) For the created_at column, do we need to use as.POSIXct() or can we use as.Date()? Response: I am not sure myself so I have just retained the POSIX code. Is that ok?
22) Step 4: Trim leading and trailing spaces from Corpus (Adel: I felt like it makes more sense to have the white space trimming after the lower case conversion - does that make sense to you?) Response: The latter steps i.e. removing common words and removing custom stop words leave lot of additional white spaces in the corpus. So, I feel it is better to keep the original version and retain white space trimming as the last step. Hope this is ok. All other suggestions under this point have been carried out.
31 Check out my feedback on 22 and where I think this should fall in the outline of the session. 32 Does that work for you? 32: Check out my feedback on 22 and where I think this should fall in the outline of the session. Does that work for you? Response: See response above for point 22.
Thanks and please let me know if you have any questions or comments.
Best wishes Vivek
Hi @adelnehme , Unfortunately, the serial numbers got auto-numbered. The points 1, 2 & 3 are for 1, 10 & 22 in your feedback respectively.
Thanks Vivek
Hi @vivekv73y :wave:
Everything's looking great from my end! The notebook's looking great π I have 2 points off feedback before we transition to a new notebook:
1) Adding a Q&A section between each major section - feel free to add it as a markdown cell with this code:
---
<center><h1> Q&A 1</h1> </center>
---
2) In section 3a) - can you use the following HTML src
function for visualizing the image with the different time formats - the align
and width
arguments let you play around with the position and the size. I recommend keeping width to 50% π
<p align="left">
<img src="https://github.com/datacamp/Brand-Analysis-using-Social-Media-Data-in-R-Live-Training/blob/master/data/striptime.png?raw=true" alt = "" width="50%">
</p>
3) For package installations - would it be possible to install cran packages from binaries? Meaning that the installation package cell should look like the following? This will make it much faster (~60s) to install the packages π
system('apt-get install r-cran-httpuv r-cran-rtweet r-cran-reshape r-cran-qdap r-cran-tm r-cran-qdapregex')
install.packages('syuzhet')
Hi @adelnehme ,
Thanks for reviewing the notebook again and providing final comments. Point 1: updated. Point 2: updated. I have set width to 40% as even 50% looks large. Hope it is ok or else please let me know and I can update to 50%. Point 3: updated. Can you please check if the coding is correct as suggested above.
Finally, I have copied all cells into a new notebook without executing those cells. Can you review using the notebook called, live_session_solution.ipynb. We will use this file going forward.
Thanks again and please let me know if you have any comments.
Best wishes Vivek
Hi @adelnehme ,
I hope you are well.
I looked at your presentation on python for spreadsheet users and noticed that you have not used animation to click through the points on any of the slides. Is there any reason to do so and do you advise me to follow the same setup in my session?
Thanks Vivek
Hi @adelnehme ,
Another question: Do you recommend having Google colab environment in white background mode with black text or "dark" setting with black background and white text. I saw your presentation and it had white background with black text. Please advise. << just to add: I like and prefer the "dark" setup as it highlights the arguments well in markdowns but can work with the light background as well :)
Thanks Vivek
Hi @vivekv73y :wave:
Thanks for reaching out!
re: slides
I personally did not find a use-case for using animations during my webinars - but if you like it please feel free to approach it this way π
re: notebook dark mode
If you prefer to use the notebook with dark mode during the session I do not see a problem in trying it out - however I'd keep the link as default (white mode) when it opens and then you can state as you open up the notebook that you like dark mode so you will be setting it this way and that students should feel free to follow suit or not π
Hope this helps!
These inputs definitely help.
Thanks a lot, @adelnehme . Vivek
Hi @vivekv73y and Sowmya :wave:
Please read the key below to understand how to respond to the feedback provided. Some items will require you to take action while others only need some thought. During this round of feedback, each item with an associated checkbox is an action item that should be implemented before you submit your content for review.
Key
:wrench: This must be fixed during this round of review. This change is necessary to create a good DataCamp live training.
:mag: This wasn't exactly clear. This will need rephrasing or elaborating to fully and clearly convey your meaning.
π£ This is something you should take a lookout for during the session and verbally explain once you arrive at this point.
:star: Excellent work!
General Feedback
[x] π§ A. I recently learned that
tidyverse
is pre-loaded in our colabs environment - so make sure to install non-tidyverse packages only π[x] π§ B. Check out points 1 and 2 and the usage of markdown to explain and showcase
lookup_users()
. I recommend doing the same for each time you introduce anrtweet
function we will not be using because of API limitations and to find opportunities to make markdown slightly more concise π[x] π§ C. In order to give you more time to go deeper into the first 5 sections - I recommend dropping section 6 and instead showcase that this could be done in the final slides - and that students should take your course to figure it out π
[x] π§ D. Make sure all sections
#
- subsections##
and sub-subsections###
are all in bold font.Notebook Review
1. Compare brand popularity by extracting and comparing follower counts
Extract user data for the twitter accounts stored in users
users_twts <- lookup_users(users)
Save extracted data a CSV file using fwrite from data.table library
fwrite(users_twts, file = "users_twts.csv")
Hence looking something like this:
3a) Visualizing frequency of tweets using time series plots
format.git.date()
function.3b) Compare brand salience for two brands using time series plots and tweet frequencies
[x] π§ 4. Consider abandoning the code cell with
search_tweets()
for Toyota - you can always verbally mention that it has been extracted the same way we extracted tesla earlier π[x] π§ 5. When you break down a function with its arguments like in the photo below - make sure to always use bullet points for arguments. This is applicable for all sections
π The rest of the notebook is in good shape for a first draft π I'll have more thorough feedback on it once all this feedback is implemented in the second round review π