Anna Vilardell
Data Analytics-Part Time Course | Barcelona June 2020
PRO7-FinalProject
Deriving meaningful insights from data, and converting knowledge into action, is easier said than done.
There are challenges that organizations face in adopting analytics. As for instance, during this project, 2 of the challenges I’ve found were:
Regardless of above mentioned challenges, the importance of data and capturing the impact from analytics will be my overall goal through this project.
As the definiton says, Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It focuses on the development of computer programs that can access data and use it to learn for themselves.
In my case, I've used machine learning to create a model that not only predicts Ironhack Sales but also shows the journey of that sale with a decision tree.
Besides my deep love for data, being the Admissions Manager at Ironhack Barcelona, have drove me to see the importance of data and its potential. Unfortunately, most of the times we collect a lot of data but without paying much attention to its quality and not having time to take a look at the one we have and use it smartly. So, this project have gave me the opportunitty to work on it.
Dataset
The dataset chosen is data of +40k potential students that have applied to 1 of our Ironhack campus between the year 2016 to 2021. Note that Ironhack has 10 campuses inclusing Remote campus.
The specific date we have from our students is besides demographic data, dates/timings of the bootcamp, the type of the bootcamp they applied to, the cost od the bootcamps...
Metadata
- bootcamp_course: course associated to the opportunity, either WD, UX or DA.
- bootcamp_format: format associated to the opportunity, either FT or PT.
- bootcamp_start_date: Start Date of the course associated to thez opportunity.
- stage: The stage name the application lives at that moment.
- lost_deal_reason: Reason why the applicant has dropped the admission process
- stage_before_lost: The stage where the opportunity is before getting lost
- created_date: Date and time when this record was created.
- close_date: Date when the opportunity is expected to close (before closing) or closed.
- drop_reason: Reason why the applicant has dropped after paying his deposit
- net_amount: Amount minus Discount or Scholarship Amount
- discount_type: Type of the discount, either Scholarship or Discount
- discount_name: Name of the discount initiative associated to the opportunity
- discount_amount: Amount of the discount that the applicant received
- scholarship_name: Name of the scholarship initiative associated to the opportunity
- scholarship_amount: Amount of the scholarship discount given to the opportunity
- financing_options: name of the financing option
- financing_option_amount: amount in local currency of the financing option
- deposit_payment: The amount of the tuition neither the student or Ironhack paid
- Stage Duration: The number of days the opportunity was in the stage listed in the Stage column
- ...
Main objectives
I've create a model to classify if a potential student is going to enrol or not to the Ironhack School and, moreover, to undersand the journey he/she is more likely to follow. See below an example of a tree where we could see the journey of the Paris Campus prospects:
The 3 objectives that will impact Ironhack are:
Next Steps
We've been able to confirm that not all campuses in Ironhack act equally when it comes to the sales funnel. Besides, we've been proving how important is to collect quality data. Therefore, my next steps will be:
- Treat global Black Data
- Recollect new data (focusing on the global importance features we will request new necessary data to be entered in Salesforce)
- New ways to get data (leaving behind manual work for the sales team during the process of admission)
- Use biggest insights from the model result, for each country, to start to dedicate ≠ resources to each type of prospect and stage
Environment
Data Preparation
2.1 EDA
2.2 Data Wrangling
- Explore Data
- Clean Data
- Remove usless columns
- Creat new columns
2.3 Feature Selection
- Feature selection was done togeter with the Model
Select From Model (Classification Models)
- Data Preprocessing
- Classification Models
- LogisticRegression
- SVM
- KNeighborsClassifier
- LinearDiscriminantAnalysis
- GaussianNB
- RandomForestClassifier
- MLPClassifier
Train Models (Machine Learning Classification Models)
- RandomForestClassifier
- Data Preprocessing
- Gridsearch
- Classifier (with best params)
- Feature Importance
- Tree Visualization
Ensemble Methods
- XGBoost
- Data Preprocessing
- Gridsearch
- Classifier (with best params)
- Feature Importance
CSVs folder
CSVs raw data (will only be available to Ironhack Staff):
CSVs clean data:
apps_allYears_semiClean.csv
apps_allYears_clean_selCols.csv
apps_allYears_clean_selCols_addCols.csv
apps_BCN20201_clean_selCols.csv
apps_BCN20201_clean_Sel_addCols.csv
apps_PAR20201_clean_selCols.csv
apps_PAR20201_clean_Sel_addCols.csv
Notebooks folder
Environment
Data Preparation
Select From Model (Classification Models)
Train Models (Machine Learning Classification Models)
Ensemble Methods
Images folder
Other documents