Tanag3r / stratascratch_projects

Contains stratascratch data project notebooks
MIT License
0 stars 1 forks source link

stratascratch_projects

Ownership: Prompt and dataset is from stratascratch.com, solution notebook & feature engineering scripts are my own work.

Work-in-Progess:

Prompt

When a consumer places an order on DoorDash, we show the expected time of delivery. It is very important for DoorDash to get this right, as it has a big impact on consumer experience. In this exercise, you will build a model to predict the estimated time taken for a delivery.

Concretely, for a given delivery you must predict the total delivery duration seconds , i.e., the time taken from:

Start: the time consumer submits the order (created_at) to

End: when the order will be delivered to the consumer (actual_delivery_time)

In addition to the system-derived data there are two values produced by other ML models for each order:

Results

The best model I have built so far uses a two-step ensemble approach:

Using this two-step, ensemble approach the best scores I have produced so far are as follows:

Although DoorDash uses RMSE to score this exercise, the MAE and RMSE-to-y_true-standard-deviation ratio provide more context:

To provide a benchmark for performance I found two other notebooks that work through this prompt and dataset:

Step One: Data Cleaning

Step Two: Feature Engineering

Summary of feature engineering:

Top ten features by importance for the two-step (prep. time pred >>> delivery time prediction) model:

Feature Score
pred_order_prep_time 0.314096
est_time_non-prep 0.050941
estimated_store_to_consumer_driving_duration 0.022149
market_id__4.0 0.012894
onshift_to_outstanding 0.012503
clean_store_primary_category__dessert 0.011970
total_items 0.010003
hour_mean_total_onshift_dashers 0.009541
estimated_order_place_duration 0.009181
clean_store_primary_category__american 0.008342

For comparison, these are the top ten features for a single model approach:

Feature Score
hour_mean_total_outstanding_orders 0.242148
est_time_non-prep 0.116266
onshift_to_outstanding 0.070525
hour_busy_outs_avg 0.032786
hour_mean_total_onshift_dashers 0.049111
market_day_mean_total_outstanding_orders 0.032590
store_day_of_week_est_time_prep_per_item_mean 0.022351
busy_to_outstanding 0.020434
orders_without_dashers 0.019857
created_day_mean_total_outstanding_orders 0.016961

Step Three: Dimensionality Reduction

Please note that this section of the project needs more attention and development. Two popular dimensionality reduction methods were considered for this project:

Recall that dimensionality reduction has two general benefits: model accuracy and compute performance. The 'reduced' models were outperformed in terms of accuracy by the 'unreduced' model, with the 'top features' approach beating out PCA.

Step Four: Modeling