dangkh / DataVisu

0 stars 0 forks source link

title: "COMP4010/5120 - Project 1" author: "Truong Tuan Vu, Nguyen Minh Tuong, Kieu Hai Dang" output: html_document: default pdf_document: default

Introduction

Data: data We choose data sourced from the National Retail Federation (NRF) in the United States about consuming for Valentine's Day. For choosing this dataset, we are aiming for finding how consummers plan to celebrate Valentine's Day that may include total spending, average spending, types of gifts planned and spending per type of gift. Additionally, it provides demographic breakdowns by age group and gender. With this dataset, we can know the trend of choosing gifts in the world, suitable for all ages, from which we can choose suitable gifts to give to our beloved women. This dataset comprises 3 distinguish files contain the following detail:

Question 1: How does the spending on different aspects?

a. How does the average percentage spending vary across different gift type between gender?

Question: How does the spending on different gift categories (e.g., Candy, Flowers, Jewelry, Greeting Cards, Evening Out, Clothing, and Gift Cards) vary across different age groups and between genders?

library(ggplot2)
library(tidyr)
library(readr) # Assuming this might be needed for read_csv if the default read.csv isn't used

# Make sure to correctly read the dataset into 'data'
data <- read.csv("./data/gifts_gender.csv") # Adjust path as needed

# Double-check the 'data' is correctly loaded
# head(data)

# Correctly apply pivot_longer to transform the data
data_long <- pivot_longer(data, cols = -Gender, names_to = "GiftType", values_to = "Percentage")

# Proceed with the rest of your plotting code
# Adjust the position_dodge width to increase space between the bars
# dodge <- position_dodge(width = 0.5) # Use 'width' for horizontal chart logic

# Create the horizontal bar plot
ggplot(data_long, aes(x = GiftType, y = Percentage, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge2", width = 0.5) +
  geom_text(aes(label = sprintf("%.1f%%", Percentage), y = Percentage + 2), position = dodge, hjust = 0.5, size = 3.0) +
  theme_minimal() +
  labs(title = "Average Percentage Spending on Gift Types by Gender",
       y = "Gift Type",
       x = "Average Percentage Spending",
       fill = "Gender") +
  theme(axis.text.y = element_text(angle = 0, hjust = 1))

The illustrated figure reveals that men allocate the majority of their expenditure (56%) towards purchasing flowers for celebrating Valentine's Day. In contrast, women predominantly spend their funds (59%) on purchasing candy. The most notable distinction is observed in the flower category, where men exhibit the highest expenditure.

b. How does the average percentage spending vary across different age group ?

# Load necessary libraries
library(ggplot2)
library(reshape2)

# Read the dataset
gifts_age <- read.csv("./data/gifts_age.csv")

# Reshape the data from wide to long format to facilitate plotting
library(tidyr)
gifts_long <- pivot_longer(gifts_age, 
                           cols = Candy:GiftCards, 
                           names_to = "GiftCategory", 
                           values_to = "Percentage")

# Plot
ggplot(gifts_long, aes(x = Age, y = Percentage, fill = GiftCategory)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  scale_fill_manual(values = c("#FFBCB4", "#FFD64C", "#00BA38", "#55FF8A", "#00B9F6", "#B0E1DD", "#C77CFF"))+
  labs(title = "Spending on Gift Categories Across Age Groups",
       x = "Age Group",
       y = "Percentage",
       fill = "Gift Category") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

When considering the impact of age on Valentine's Day celebrations, a contradiction emerges between younger individuals and the elderly. Younger people are more inclined to purchase candy compared to greeting cards.

Question 2: Overall spending changing over years

Question: How have the overall spending on celebrating, the per-person spending, and the spending on different gift categories (e.g., Candy, Flowers, Jewelry, Greeting Cards, Evening Out, Clothing, and Gift Cards) changed over the years, and how do these trends relate to economic factors or events?

a. How many people celebrate valentine in period of 2010-2022?

First, we explore how many people celebrate valentine in period of 2010-2022. To do this, we employ column "peoplecelerating" from "historical_spending.csv" file.

The figure illustrates that a majority of individuals actively participate in Valentine's Day celebrations. To gain deeper insights into the prevailing trend during this occasion, we delve into the fluctuating number of participants over the years to discern any discernible patterns. This exploration is facilitated through the analysis of data sourced from "historical_spending.csv." through a line chart. Moreover, we use an indicator (covid-19 outbreak event) to explore the it's impact on people behavior.

library(ggplot2)
library(dplyr)
# Assuming 'data' has been read from "historical_spending.csv"
# Ensure this line is correctly loading your data
data <- read.csv("./data/historical_spending.csv")
# Create a new column for labels, spacing out every two years
meanCele = mean(data$PercentCelebrating)

meanNot = 100 - meanCele
newdata <- data.frame(
  group=c("Celebrating", 'Not celebrating'),
  value=c(meanCele, meanNot)
)
newdata <- newdata |>
  arrange(desc(group)) |>
  mutate(prop = value / sum(newdata$value) *100) |>
  mutate(ypos = cumsum(prop)- 0.5*prop )

# Basic piechart
ggplot(newdata, aes(x="", y=prop, fill=group)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() + 
  theme(legend.position="none") +

  geom_text(aes(y = ypos, label = group), color = "white", size=6) +
  scale_fill_brewer(palette="Set1")

b. How have the percentage of people celebrating changed over the year ?

The depicted figure reveals a declining trend in the number of individuals participating in Valentine's Day celebrations. However, this downward trajectory was disrupted in 2020 due to the outbreak of COVID-19 and subsequent national lockdowns. Despite a temporary resurgence during the pandemic, the trend of declining participation has persisted even after the pandemic has been brought under control. For the data in the last collected years, we observer more than half (53 percent) of U.S. consumers plan to celebrate the holiday in 2022, up from 52 percent in 2021.

library(ggplot2)
library(emojifont)
library(emoGG)
# search_emoji("heart")
# Assuming 'data' has been read from "historical_spending.csv"
# Ensure this line is correctly loading your data
data <- read.csv("./data/historical_spending.csv")
# Create a new column for labels, spacing out every two years
data$Label <- ifelse(seq_along(data$Year) %% 2 == 0, paste0(data$PercentCelebrating, "%"), NA)

# Adjust the annotation position for the COVID-19 label
annotation_y_position <- max(data$PercentCelebrating, na.rm = TRUE) * 0.95 # Adjust vertically to avoid overlap
annotation_x_position <- 2020 - 2 # Move text to the left of the vertical line
spline_int <- as.data.frame(spline(data$Year, data$PercentCelebrating))
# Create a line plot with a COVID-19 pandemic indicator
ggplot(data, aes(x = Year, y = PercentCelebrating)) + geom_point()+
  geom_emoji(emoji = "2764")+
  geom_line(data = spline_int, aes(x = x, y = y),color="#c90970", size=1.0) + # Draw the line
  geom_label(
    aes(label = Label),
    nudge_x = 0.45,
    nudge_y = 0.45,
    check_overlap = TRUE
  )+

  geom_vline(xintercept = 2020, linetype = "dashed", color = "black") + # Add a vertical line for the pandemic start
  annotate("text", x = annotation_x_position, y = annotation_y_position, label = "COVID-19 Pandemic Start", vjust = -0.5, color = "black", angle = 0) + # Adjusted annotation
  theme_minimal() + # Use a minimal theme
  labs(title = "Percentage of People Celebrating Over the Years",
       x = "Year",
       y = "Percentage Celebrating") +
  scale_x_continuous(breaks = data$Year) # Ensure all years are shown

c. How have the people spending changed over the year

Similarly, we utilize a line chart to illustrate the trend of expenditure on Valentine's Day. Utilizing the same dataset and event indicators, we observe a contradictory phenomenon. Spending on celebrations experienced an upward trajectory between 2010 and 2022. On average, consumers anticipate spending $185.81 each, representing an increase of nearly $8 compared to the average Valentine’s Day expenditure over the past five years. In contrast, only $103 was recorded as the average spending per person in 2010.


library(ggplot2)

# Assuming 'data' has been read from "historical_spending.csv"
# Ensure this line is correctly loading your data
data <- read.csv("./data/historical_spending.csv")

# Optional: Create a new column to indicate which points to label
data$Label <- ifelse(seq_along(data$Year) %% 2 == 0, as.character(data$PerPerson), NA) # Label every other point

# Adjust the annotation position based on the PerPerson spending range
annotation_y_position <- max(data$PerPerson) * 0.95 # Adjust vertically to avoid overlap
annotation_x_position <- 2020 - 2 # Move text to the left of the vertical line if needed
spline_int <- as.data.frame(spline(data$Year, data$PerPerson))
# Create a line plot focused on Per Person Spending over the years
ggplot(data, aes(x = Year, y = PerPerson)) +
  geom_line(data = spline_int, aes(x = x, y = y),color="#c90970", size = 1.0) + # Draw the line
  geom_point(color = "blue") + # Add points
  geom_text(aes(label = Label), vjust = -1, check_overlap = TRUE) + # Add labels for spaced-out points
  geom_vline(xintercept = 2020, linetype = "dashed", color = "black") + # Add a vertical line for the pandemic start
  annotate("text", x = annotation_x_position, y = annotation_y_position, label = "COVID-19 Pandemic Start", vjust = -0.5, color = "black", angle = 0) + # Adjusted annotation
  theme_minimal() + # Use a minimal theme
  labs(title = "Per Person Spending Over the Years",
       x = "Year",
       y = "Per Person Spending (US Dollar)") +
  scale_x_continuous(breaks = data$Year) # Ensure all years are shown

d. How have the spending on different gifts categories changed over the years ?

In this section, multiple line charts are employed to illustrate various trends. Notably, for the year 2022, it is projected that total spending on Jewelry and Evening Dates could increase by more than $45 and $31, respectively. After the pandemic, only gift card purchases experienced a surge in volume, while other categories exhibited a downward trend. The most significant decline was observed in dining out, nearly reaching the level of expenditure on clothing. However, the costliest gift category, jewelry, more than doubled from $21.52 to $45.57 over the recorded period, while spending on the remaining categories remained relatively unchanged, resulting in an overall uptrend in spending per person. We also obtain the price of gold in same period to get more insights. Moreover, we further compare the percentage people spending instead of price. A surprise observation is that jewelry not change the propotion much.

library(ggplot2)
library(tidyr)
library(dplyr) # For data manipulation

# Assuming your data is already loaded into 'data'
data <- read.csv("./data/historical_spending.csv")

# Transform data from wide to long format
data_long <- pivot_longer(data, cols = -Year, names_to = "Category", values_to = "Spending")

# Filter out 'PerPerson' and 'PercentCelebrating' categories
data_long_filtered <- data_long %>%
  filter(!Category %in% c("PerPerson", "PercentCelebrating")) %>%
  mutate(SpendingLabel = ifelse(Year %% 6 == 0, as.character(Spending), NA)) # Add labels for every 3rd year for clarity

# Custom color palette (adjust as needed for your categories)
my_colors <- c("Candy" = "darkred", "Flowers" = "darkgreen", "Jewelry" = "#0072B2", "GreetingCards" = "darkorange", "EveningOut" = "#5D3FD3", "Clothing" = "darkmagenta", "GiftCards" = "darkcyan")

# Create the line plot with customized points
ggplot(data_long_filtered, aes(x = Year, y = Spending, color = Category)) +
  geom_line() +
  geom_point(aes(shape = Category), size = 2, stroke = 2) + # Customized points with different shapes for categories
  geom_text(aes(label = SpendingLabel), vjust = -1.5, check_overlap = TRUE) + # Add labels for spending every 3 years
  geom_vline(xintercept = 2020, linetype = "dashed", color = "red", size = 1) + # Add a vertical line for the COVID-19 pandemic start
  annotate("text", x = 2020, y = 23, label = "COVID-19 Pandemic Start", vjust = -1, color = "red", angle = 0, hjust = 1.1, size = 5) + # Annotate the line
  theme_minimal() +
  theme(
    panel.grid.major = element_line(color = "grey80"), # Darker grid lines
    panel.grid.minor = element_line(color = "grey80", size = 0.25)  # Darker and finer minor grid lines
  ) +
  labs(title = "Yearly Spending on Different Gift Categories",
       x = "Year",
       y = "Spending (US Dollar)",
       color = "Category") +
  scale_x_continuous(breaks = seq(min(data$Year), max(data$Year), by = 1)) + # Adjust the x-axis breaks if needed
  scale_y_continuous(labels = scales::comma) + # Use comma for large numbers, remove if you prefer the log scale
  scale_shape_manual(values = c(16, 17, 18, 19, 20, 21, 22)) + # Custom shapes for categories, adjust numbers as needed
  scale_color_manual(values = my_colors) # Use custom colors

Conclusion:

The analysis conducted reveals insightful trends in Valentine's Day celebrations and spending patterns over the years. Despite a general decline in the number of people participating in Valentine's Day festivities, a surge was observed in 2020 due to the COVID-19 pandemic, followed by a continuation of the downward trend post-pandemic. Conversely, expenditure on Valentine's Day activities exhibited a consistent upward trajectory between 2010 and 2022, with consumers anticipating an increase in spending per person, notably in categories such as Jewelry and Evening Dates. During the pandemic, while gift card purchases surged, other categories experienced a decline, particularly in dining out, which approached levels similar to clothing expenditure. Notably, jewelry emerged as the costliest gift category, more than doubling in average expenditure per person. Overall, the analysis indicates shifting trends in both participation and spending habits, influenced by external factors such as the pandemic.