I. Introduction
II. Data Availability and Provenance Statements
III. Requirements
V. Instructions to Replicators
Author | Contact |
---|---|
Bianca Magalotti | b.magalotti@tilburguniversity.edu |
Emanuele Franceschini | e.franceschini@tilburguniversity.edu |
Tommaso Rabino | t.rabino@tilburguniversity.edu |
Akram Benmrit | a.benmrit@tilburguniversity.edu |
Colin Tan | c.tan@tilburguniversity.edu |
The rapid growth of Airbnb, a global platform for short-term property rentals, has introduced new challenges and opportunities for hosts, necessitating data-driven tools and insights to empower them. The objective of this study is twofold. First, an XGBoost Regression Model designed to assist Airbnb hosts in setting optimal rental prices for their listings was developed. Second, insights from the regression model and other data analysis techniques were levereged to highlight the critical factors influencing listing prices. Our findings not only provide valuable decision-making tools for hosts but also contribute to the broader discourse on short-term property rentals in Italy.
The code in this replication package constructs all the files in the folders data preparation
and analysis
from the data source (Milan listings.csv from InsideAirbnb) using R. Six main code files (5 in RMarkdown, 1 in plain R) run all of data, insights and results discussed in the final paper, which can be find in the following folder: gen/paper/output. The replicator should expect the code to run for about 4 hours.
The data was gathered from Inside Airbnb, a website operated directly by Airbnb that provides publicly available dataset of Airbnb listings. This platform provides the most recent version of required data for a multitude of cities around the world, allowing users to download specific datasets for their specific needs.
Specifically, the website provides both a downloader page, where the dataset can be freely downloaded (Inside Airbnb: Get the Data), and an explorer page, where a user-friendly tool enables users to analyze and explore specific listings dataset (Inside Airbnb: Explore).
For the purposes of this project, the dataset used was “listings.csv.gz”, representing the detailed listings data specifically for the city of Milan, last update 21 June, 2023. The dataset can be downloaded and explored from the following two links: Inside Airbnb: Milan Listings Downloader; Inside Airbnb: Milan Listings Explorer.
|-- data
|-- dataset1
|-- gen
|-- analysis
|-- input
|-- data-preparation
|-- input
|-- paper
|-- input
|-- output
|-- src
|-- analysis
|-- 5_regression_model
|-- clean-up_an
|-- data-preparation
|-- 0_installing_packages
|-- 1_data_download
|-- 2_data_cleaning
|-- 3_data_exploration
|-- 4_data_preparation
|-- clean-up_dp
|-- shinyapp
|-- 6_shinyapp
|-- save_time_objects
|-- .gitignore
|-- README.md
|-- makefile
Additionally, once all source files of the present project are run, the following dataset will be automatically stored in the described folders and formats:
Folder | Name | Format |
---|---|---|
data/dataset1 |
milan_listings | .csv |
gen/data-preparation/input |
raw_data | .rds |
Folder | Name | Format |
---|---|---|
gen/data-preparation/input |
clean_data | .rds |
Folder | Name | Format |
---|---|---|
gen/analysis/input |
regression_data | .rds |
These steps ensure that users can always inspect the dataset characteristics at each stage of the project.
The dataset used for the present project comprises 23,142 observations, each corresponding to individual Airbnb listings in Milan, spanning 75 variables that offer diverse insights into listing characteristics, management, and performance.
It's crucial to highlight that not all listed variables will be utilized in subsequent analyses. Some will be omitted due to their lack of relevance or contribution to the analytical objectives of this project. Conversely, in the course of analysis, certain feature engineering operations were applied to create new variables. This enhances the encapsulation and representation of information in the dataset, facilitating a more comprehensive understanding of trends, patterns, and insights. Finally, due to data cleaning operations, the number of observations in the dataset may vary.
Here’s an overview of what each (original) variable represents:
Variable | Description or Motivation for Removal |
---|---|
id |
A unique identifier for the listing. |
listing_url |
The URL of the listing on Airbnb. |
scrape_id |
The unique id of the scraping session. |
last_scraped |
The date when the data for the listing was last scraped. |
source |
The origin of the listing data. |
name |
The name of the listing. |
description |
A comprehensive description of the listing. |
neighborhood_overview |
An overview of the listing's neighborhood. |
picture_url |
The URL of the listing's featured picture. |
host_id |
A unique identifier for the host of the listing. |
host_url |
The URL of the host's Airbnb profile. |
host_name |
The name of the host. |
host_since |
The date when the host joined Airbnb. |
host_location |
The location of the host. |
host_about |
Information provided by the host about themselves. |
host_response_time |
The typical amount of time the host takes to respond to messages. |
host_response_rate |
The host’s response rate to messages. |
host_acceptance_rate |
The rate at which the host accepts booking requests. |
host_is_superhost |
Indicator of whether the host is a Superhost. |
host_thumbnail_url |
The URL of the host’s thumbnail picture. |
host_picture_url |
The URL of the host’s profile picture. |
host_neighbourhood |
The neighborhood the host is located in. |
host_listings_count |
The total number of listings the host has. |
host_total_listings_count |
The total number of listings the host has across all platforms. |
host_verifications |
The methods the host has used to verify their identity. |
host_has_profile_pic |
Indicator of whether the host has a profile picture. |
host_identity_verified |
Indicator of whether the host’s identity has been verified. |
neighbourhood |
The neighborhood the listing is located in. |
neighbourhood_cleansed |
The cleaned name of the neighborhood the listing is located in. |
neighbourhood_group_cleansed |
The cleaned name of the neighborhood group the listing is located in. |
latitude & longitude |
The geographical coordinates of the listing. |
property_type |
The type of property listed. |
room_type |
The type of room listed. |
accommodates |
The number of people the listing can accommodate. |
bathrooms |
The number of bathrooms in the listing. |
bathrooms_text |
Textual description of the bathrooms. |
bedrooms |
The number of bedrooms in the listing. |
beds |
The number of beds in the listing. |
amenities |
The amenities offered by the listing. |
price |
The price of the listing per night. |
minimum_nights to maximum_nights |
Various restrictions and requirements related to the minimum and maximum nights a guest can book. |
calendar_updated |
When the listing’s calendar was last updated. |
has_availability |
Indicator of whether the listing is available. |
availability_30 to availability_365 |
The number of days the listing is available over different time spans. |
calendar_last_scraped |
The date when the listing’s calendar was last scraped. |
number_of_reviews to number_of_reviews_l30d |
Various measures of the number of reviews the listing has received. |
first_review & last_review |
Dates of the first and last reviews received. |
review_scores_rating to review_scores_value |
Various scores representing the quality of the listing as rated by guests. |
license |
The license number of the listing, if applicable. |
instant_bookable |
Indicator of whether the listing can be booked instantly. |
calculated_host_listings_count to calculated_host_listings_count_shared_rooms |
Various measures of the number of listings the host has. |
reviews_per_month |
The average number of reviews the listing receives per month. |
The present project does not involve exceptionally large datasets, and the R environment is systematically cleaned by a specific code snippet at the end of each code file, making the project accessible for a standard PC commonly available in 2023. The whole set of code files were last run on a Apple MacBook Air (2020), with the following technical specifications: (i) CPU: Apple M1 8-core - 3.2 GHz; (ii) GPU: Apple M1 7-Core GPU; (iii) RAM: 8GB; (iv) SSD: 256GB; (v) Operating System: MacOS 14.
R & RStudio --> The code was developed and executed in R (R version 4.2.2), utilizing RStudio (RStudio version 2022.12.0+353) as the integrated development environment (IDE). The software and the programming language can be installed from this link: R and RStudio Installation Guide.
R Libraries and Packages --> The source code files utilize the following R packages and libraries. You do nnot need to download or load them in advance, the source code file 0_installing_packages will handle this issue for you.
- GENERAL PACKAGES:
library(readr)
library(tidyverse) #A "Package of Packages" for Data manipulation and visualization (includes magrittr, lubridate, purrr, tidyr, etc.).
library(dplyr) #Data frame manipulations (select, slice, etc.
library(jsonlite) #For Amenities Columns Creation
library(moments) #Measuring the skewness.
- REGRESSION PACKAGES
library(caret) #Hyperparameters Tuning.
library(xgboost) #XGBoost Regression.
library(DALEX) #Summary of the XGBoost Regression Model ("explainer).
library(bayesforecast) #Checking Regression Assumptions.
- SHINYAPP PACKAGES
library(shiny) #For the ShinyApp
library(shinyWidgets) #For the ShinyApp
- PLOT AND FIGURES PACKAGES:
library(ggplot2) #Building fancy plots.
library(ggthemes) #Themes for ggplots (e.g. "solarized").
library(ggcorrplot) #For correlograms
library(scales) #Scaling and formatting ggplots (e.g. scale_fill_gradient()).
library(gt) #Latex tables
- WORKING DIRECTORY SETTING PACKAGES
library(here)
library(rstudioapi)
LaTex Distribution --> To compile the final paper into A PDF document with LaTeX styling, you need to have a LaTeX distribution installed on your computer. To install a LaTeX Distribution:
LaTex Packages --> For the same purpose, you also need to need to to make sure that the following necessary LaTeX packages are installed in your LaTeX distribution. You can typically install missing packages using the package manager of your LaTeX distribution.
%----------------------------------------------------------------------------------------
% FONTS, MARGINS, AND PDF STYLING
%----------------------------------------------------------------------------------------
- babel: Language settings.
- fontenc: Font encoding.
- inputenc: Required for inputting international characters.
- mathpazo: Use the Palatino font.
- microtype: Slightly tweak font spacing for aesthetics.
- mathptmx: Times New Roman font for text.
- helvet: Arial-like font for sans-serif.
- setspace: Line spacing.
- geometry: set the margin.
- amsmath: Math equations.
- amssymb: Math symbols.
- hyperref: Hyperlinks and URLs.
- enumerate: Enumerate environment.
- enumitem: Required for list customization.
- multicol: For two columns.
%----------------------------------------------------------------------------------------
% HEADERS, FOOTERS, TITLE, ABSTARCT, BIBLIOGRAPHY, CAPTIONS AND GRAPHICS
%----------------------------------------------------------------------------------------
- fancyhdr: Header and footer customization.
- titlesec: Section titles formatting.
- titling: Required for customizing the title section.
- biblatex: to style bibliography.
- natbib: Citation style.
- appendix: Appendix formatting.
- abstract: Abstract formatting.
- caption: Captions customization.
- graphicx: Graphics.
%----------------------------------------------------------------------------------------
% TABLES
%----------------------------------------------------------------------------------------
\usepackage{color}
\usepackage{rotating}
\usepackage{tabularray}
\usepackage{booktabs}
%----------------------------------------------------------------------------------------
% TABLES
%----------------------------------------------------------------------------------------
- etoolbox
- footmisc
- listings
RMarkdown --> RMarkdown was used to convert the code from RStudio into more comprehensible pdf documents, allowing for a seamless representation of the analysis flow. Refer to the RMarkdown Installation Guide" for detailed instructions On how to install RMarkdown into your RStudio environment.
make --> The build tool make was employed to manage the automation of the compilation of all source code files and the final paper pdf document. The guide to install make can be found at this page: Make Installation Guide.
Pandoc --> Finally, to make sure that your computer is able to compile the pdf document resulting from the RMarkdown source files, you should install Pandoc by following this guide: Pandoc Installation Guide.
The whole set of code files were last run on a Apple MacBook Air (2020). On this hardware, the code took almost 5 hours to generate the whole output. Most of the time required to run the entire code is spent to process the following R objects:
- xgb_caret.rds (4 hours)
- xgb_mod.rds (5 minutes)
- xgbcv.rds (10 minutes)
Therefore, in case you want to save almost the entire time needed to run the whole set of source code files, follow the "Save Time Instructions" in the section "Instructions to Replicators".
set.seed(999)
All source code files present in this repository are described in the table below:
File Name | File Format | File Description | File Location | File Output |
---|---|---|---|---|
0_installing_packages | .R | Installs all the necessary R packages and set the working directory to the source file location. | src/data-preparation | N/A |
1_data_download | .Rmd | Downloads the data source zip file, extract the database, and load it into R. | src/data-preparation | 1_data_download.pdf |
2_data_cleaning | .Rmd | Cleans the dataset from NAs, outliers, and useless or empty columns. | src/data-preparation | 2_data_cleaning.pdf |
3_data_exploration | .Rmd | Set of EDA operations, including correlograms and categorical variable visualizations. | src/data-preparation | 3_data_exploration.pdf |
4_data_preparation | .Rmd | Set of operations needed to prepare the dataset for the regression modeling, including computation of logarithm of the DV, one-hot encoding of factor variables, centering and scaling numeric variables, and dividing the dataset into a training and a testing dataset. | src/data-preparation | 4_data_preparation.pdf |
5_regression_model | .Rmd | Hyperparameter tuning, determining the optimal number of iterations, training the model and assessing its performance, checking regression assumptions. | src/analysis | 5_regression_model.pdf |
6_shinyapp | .R | Develops an interactive and user-friendly ShinyApp capable of predicting the price of an Airbnb listing located in Milan. The ShinyApp uses the previously trained and validated regression model to predict the price of a listing whose characteristics (number of rooms, beds, bathrooms, and accommodated people, location, type of apartment, etc.) can be defined a priori by the user. | src/analysis | ShinyApp Interface |
clean-up_dp | .R | Eliminates all not relevant file, including .RHistory and .RData, from the folder src/data-preparation. | src/shinyapp | N/A |
clean-up_an | .R | Eliminates all not relevant file, including .RHistory and .RData, from the folder src/analysis. | src/analysis | N/A |
final_paper | Pdf file with all results and insights gained from the anlysis. | gen/paper/output | N/A |
To automatically run all source code files of which this project is composed, pleas follow these instructions:
Copy the HTML code of this GitHub repository.
Open your command line / terminal and select a working directory where you want to store this project's repository. The following is an example of how to change the working directory (replace "C:/Users/Admin/Desktop" with the name of your selected directory):
cd "C:/Users/Admin/Desktop"
Then, copy and paste the following command to your command line / terminal (you can also manually copy-and-paste the HTML code of this GitHub repository that you have previously copied in step 1):
git clone https://github.com/course-dprep/team-project-team_9_group_project.git
Set your working directory to the project repository using the following command (replace _your_repositorypath with the directory you have selected in step 2):
cd "your_repository_path/team-project-team_9_group_project"
Type the following command on your terminal / command prompt:
make
When make has succesfully run all the code, directly on your terminal / command prompt it will appear a message such as the following:
Listening on http://127.0.0.1:3580
Open the link that appear on the screen of your terminal / command prompt in your browser (e.g. Google Chrome, Safari, etc.).
The ShinyApp will be rendered and you will be able to start playing with it :)
In case you want to save almost the entire time needed to run the whole set of source code files of which this project is composed, follow these instructions:
The _5_regressionmodel file code is written in such a way as to avoid reprocessing the mentioned R objects if they are already located in the mentioned folder.
An alternative route to run the code would be to run (or knitr) all .R (and .Rmd) files in order (follow the numbers in the files names). Note: through this alternative route, the final_paper.pdf document will not be generated automatically.