Exploring Factors of Popularity in U.S. Youtube Videos

by CharlieBitMyFinger (Jesse Solinger, Oliver Wan, and Riley Burns)

Introduction

We aim to examine what makes a Youtube video popular using a dataset that contains Youtube data. Youtube remains a dominant force for videos on the internet. Which videos are popular matters for two main reasons: money and influence.

Youtube advertising revenue is big business. Youtube earned $28.8 billion in advertising revenue in 2021 alone (The Hollywood Reporter - 2/1/22). Furthermore, thousands of people earn income by creating Youtube videos and earning a cut of the advertising revenue generated by their videos. The most popular Youtube channels can garner hefty incomes, while many other creators strive to sustain themselves with Youtube income alone but have to work side jobs in order to get by.

Moreover, Youtube content has influence over the public. Popular Youtube videos can have significant effects on people’s perspectives of current events. Youtube videos are important to politics and foreign policy. They can be a controversial medium for spreading misinformation. Additionally, Youtube videos have become an important part of our culture as humans. People present music, art, and comedy on Youtube. Scientific breakthroughs are presented through Youtube. People dream of big ideas when they watch videos on Youtube.

Youtube is important because of money and influence - hence we want to learn what factors (on average) make a Youtube video popular.

Methodoology

Throughout this project, we use data provided by Kaggle, an online resource that provides community published data and code ready for use and manipulation. We specifically use the dataset called “Trending YouTube Video Statistics,” which provides daily statistics for trending Youtube videos. This dataset provides us with several months worth of data, including the video title, channel title, publish time, tags, views, likes, dislikes, description, comment count, and category id. In addition, this dataset includes data from the U.S., Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan, and India, but each region’s data is separated into its own file. This project will focus solely on data from the U.S. because taking all of the data from every region would result in an enormous dataset. With so many megabytes of data being processed by our code, our program may have been a bit slower with all countries' data included. The U.S. is also the country we know the best so designing our analysis will be easier and the conclusions we draw will be more accurate.This is because we are American researchers who can relate more to American viewers and their search habits. In future research however, it would be very interesting to explore global data and make comparisons between different countries.

The dataset we initially import contains 16 variables, including likes, dislikes, video title, channel title, comment count, and more. One of the variables that was of specific interest to us is called category_id, which provides us with each Youtube video’s identification number based on the category the video’s placed in. Looking into this further, we realized we didn’t know what each category was. In order to combat this problem, we decided to add a 17th variable to our dataset! We added a variable called category, which provides us with the category name, or genre, associated with each category ID number in the previously mentioned variable. This addition allows us to make more connections between video categories and variables such as likes, dislikes, and comment count. From this, we will be able to draw conclusions about Youtube videos’ popularity based on their category placement.

In preparing our data for manipulation and analysis, we did not need to clean or alter our data too much. The only necessary alteration was to change the structure of our trending_date variable from the yy.dd.mm format to the mm.dd.yy format because of how R understands the data. Following this step, we were able to begin plotting and finding meaningful connections between our variables!

Returning back to our main research question “What makes a Youtube video popular?,” we focus on creating visualizations that explore this question from all angles, using all of our variables. We also try to stay pretty creative with our approach to visualizations in order to make it fun and interesting to read! Thinking about what factors impact a Youtube video’s popularity, our thoughts go directly to the number of views, the likes, and the dislikes. However, we can also think about the comment count, the time of day a video is published, the Youtube user’s previous viewings/following, and the category a video falls under. Keeping this in mind, we were able to create visualizations that show us meaningful relationships between variables, most of which are broken down by category. Many of these visualizations include linear regressions, scatter plots, bar graphs, faceted plots, coordinate plots, and box plots, giving us plenty of different ways to analyze our data and what variables impact popularity the most.

Discussion

Our goal for the project was to create visualizations through the use of different plots and Rstudio to demonstrate our findings in terms of what makes a Youtube video popular.

The numerous graphs created throughout this project allowed for us to develop a multi-faceted understanding of the data we used to analyze Youtube trends. The analysis for our data began with investigations involving the comparison between Views vs. Likes, Views vs. Dislikes, and Views vs. Comment Count. When analyzing the relationships between these variables, a similar framework was used where three graphs were created. The first is a scatter plot, the second is a scatter plot with a linear model, and the third is scatter plots with linear models that are faceted by the category type. The scatter plot created by Views vs. Likes demonstrated that videos with a greater like count tend to lead to a higher view count as well that can be seen within the positive relationship within the plot. The scatter plot with the linear model also displays how there is a positive correlation between the two variables. These patterns can be seen in the graphs involving other variables such as that of Views vs. Dislikes and Views vs. Comment Count. Reasons for the similarities within these relationships can be stemmed from how a comment/like/dislike can equate to a form of engagement with the content, which usually means that someone has viewed the video. Therefore, from looking at the relationships from these graphs, it could be said that increased likes/comments/dislikes will most likely lead to a higher view count as well.

The next portion of our project looked at the relationship between views and variables that shifted temporally. We specifically looked at the intersections of views and when they were uploaded specifically looking at hour and month, and date published. When looking at the graph for date published, it was apparent that videos with the most views were posted from the months of November through to June. Upon seeing this data visually, conversations around how seasons may be impacting the viewer count were brought into play. According to the date published and view graph, it could be said that views tend to be higher in the US during months in which there is a transition into winter, and a decrease as they exit out into summer within the data set. This theory could potentially be explained by how people may want to increase time spent indoors viewing youtube content during the winter to escape cold weather conditions, and decrease time spent indoors viewing youtube content when weather conditions become warmer. In addition to the date published and views graphs, other circle graphs were created to help with our data visualization. The view and upload time (month) graph displays how the videos posted in the months of both April and July tend to lead to higher views. A similar graph except instead of Month has time as the other variable, displays that the time with the highest view count is at around 4am in the morning. These graphs provided interesting insights, particularly the views and hour graph as we did not expect the videos with the most views to be posted at such an earlier hour, and would have thought that videos with highest view count would be posted at times such as 5pm which would be when people would be getting off work rather than being asleep.

Upon looking at the categorical aspect to this project, we found some interesting findings when it came to category analysis and view count. Through the use of a bar plot, we were able to understand which category of videos would have more videos. From the graph, it was apparent that the entertainment category had the most views, with music coming in second. To further explore these variables, we then created another box plot that looked at categories and views. Similar to the bar plot, the categories of both entertainment and music were the categories with the highest view count. Furthermore, in a segment plot, the music category also had the highest mean. Altogether, these graphs indicate that the music and entertainment categories are ones that bring out high view count. From analyzing the number of videos in each category, it could be fair to say that reasons for why music and entertainment may have such high views is because of the high number of videos within the category, allowing for increased opportunities for another view.

Our final observation stemmed from a graph that looks at the relationship between youtube views and the number of days between the date it was posted. Other than a few outliers, it would be apparent that videos tend to garner higher views and tend to begin shortly after they are published. This can indicate that may of the videos that have high view counts, are videos that are of high interest and are ones that are viewed mostly all at once compared to videos that generate views over a longer period of time.

Presentation

Our presentation can be found here.

Data

The data is from Kaggle. It was posted to Kaggle by a user called Mitchell J. Mitchell J was able to build the dataset by writing a Python script that scraped the web for Youtube data (scraped code can be found here) . The Kaggle post includes data on Youtube statistics for the U.S., Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan, and India (each country is posted in a separate dataset).

J.M., 2019, Trending YouTube Video Statistics, electronic dataset, kaggle, .

References

Kaggle Youtube Dataset: (https://www.kaggle.com/datasets/datasnaek/youtube-new?select=USvideos.csv)

DCS-210 / w2022-project-charliebitmyfinger

readme