Telenav / osrm-backend

Open Source Routing Machine - C++ backend
http://map.project-osrm.org
BSD 2-Clause "Simplified" License
18 stars 7 forks source link

Machine Learning Experiment for ETA service #356

Open CodeBear801 opened 4 years ago

CodeBear801 commented 4 years ago

Subtask of #355

We plan to build a machine learning model based on user's gps trace data. Here record some experiments and proof of concept for understanding the problem set.

There are several experiment I have done to get familiar with ML, here I record 3 of them which I feel is highly related:

Data and characteristics determine the upper limit of machine learning, and models and algorithms just approach this upper limit.

New York City Taxi Trip Duration from Kaggle

Data source

https://www.kaggle.com/c/nyc-taxi-trip-duration/data

    id  vendor_id   pickup_datetime dropoff_datetime    passenger_count pickup_longitude    pickup_latitude dropoff_longitude   dropoff_latitude    store_and_fwd_flag  trip_duration
0   id2875421   2   2016-03-14 17:24:55 2016-03-14 17:32:30 1   -73.982155  40.767937   -73.964630  40.765602   N   455
1   id2377394   1   2016-06-12 00:43:35 2016-06-12 00:54:38 1   -73.980415  40.738564   -73.999481  40.731152   N   663

OSRM features

id total_distance total_travel_time number_of_steps
id2875421 2009.1 164.9 5
id2377394 2513.2 332.0 6
id3504673 1779.4 235.8 4

Weather feature

I think weather feature is crawling from open data website, you could find related data for this Kaggle competition here. More information you could go to here -> 6.1 Weather reports

Feature extracting

Training

XGBoosting

xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,
            'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}

model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)

Parameter Tune

Most of parameters in XGBoost are about bias variance tradeoff. When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit. XGBoost Parameters

Try with different parameters

# random search on different parameter combination
xgb_pars = []
for MCW in [10, 20, 50, 75, 100]:
    for ETA in [0.05, 0.1, 0.15]:
        for CS in [0.3, 0.4, 0.5]:
            for MD in [6, 8, 10, 12, 15]:
                for SS in [0.5, 0.6, 0.7, 0.8, 0.9]:
                    for LAMBDA in [0.5, 1., 1.5,  2., 3.]:
                        xgb_pars.append({'min_child_weight': MCW, 'eta': ETA, 
                                         'colsample_bytree': CS, 'max_depth': MD,
                                         'subsample': SS, 'lambda': LAMBDA, 
                                         'nthread': -1, 'booster' : 'gbtree', 'eval_metric': 'rmse',
                                         'silent': 1, 'objective': 'reg:linear'})

It takes extremely large amount of resources and time.

Cross Validation

http://blog.mrtz.org/2015/03/09/competition.html

Flight Delay Estimation(gcloud)

Input Data

|summary|   FL_DATE|UNIQUE_CARRIER|        AIRLINE_ID|CARRIER|            FL_NUM| ORIGIN_AIRPORT_ID|ORIGIN_AIRPORT_SEQ_ID|ORIGIN_CITY_MARKET_ID|ORIGIN|   DEST_AIRPORT_ID|DEST_AIRPORT_SEQ_ID|DEST_CITY_MARKET_ID|DEST|       CRS_DEP_TIME|           DEP_TIME|         DEP_DELAY|          TAXI_OUT|         WHEELS_OFF|          WHEELS_ON|          TAXI_IN|       CRS_ARR_TIME|           ARR_TIME|         ARR_DELAY|           CANCELLED|CANCELLATION_CODE|            DIVERTED|         DISTANCE|   DEP_AIRPORT_LAT|   DEP_AIRPORT_LON|DEP_AIRPORT_TZOFFSET|   ARR_AIRPORT_LAT|   ARR_AIRPORT_LON|ARR_AIRPORT_TZOFFSET|EVENT|NOTIFY_TIME|

y = 0 if arrival delay >= 15 minutes y = 1 if arrival delay < 15 minutes // marching learning algorithm predict the probability that the flight is on time

Logic Regression via Spark

more info

After recording all data into csv, could load data into dataframe or rdd(difference), then generate dataframe contains result after features engineering, then calling train

examples = traindata.rdd.map(udf)
lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)

prediction

lrmodel = LogisticRegressionModel.load(sc, MODEL_FILE)
lrmodel.setThreshold(xxx)
lrmodel.predict(Independent features)

evaluation

def eval(labelpred):
    cancel = labelpred.filter(lambda (label, pred): pred < 0.7)
    nocancel = labelpred.filter(lambda (label, pred): pred >= 0.7)
    corr_cancel = cancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
    corr_nocancel = nocancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()

    cancel_denom = cancel.count()
    nocancel_denom = nocancel.count()
    if cancel_denom == 0:
        cancel_denom = 1
    if nocancel_denom == 0:
        nocancel_denom = 1
    return {'total_cancel': cancel.count(), \
            'correct_cancel': float(corr_cancel)/cancel_denom, \
            'total_noncancel': nocancel.count(), \
            'correct_noncancel': float(corr_nocancel)/nocancel_denom \
           }

Tensorflow via Cloud Dataproc

More info

Flight Delay Estimation(open source stack)

Classification

more info

How to improve prediction model and how to evaluate

more info

CodeBear801 commented 4 years ago

Discussion on 07182020

more resources: https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/team_discussion_for_06182020.md

Briefly go-though: https://github.com/Telenav/osrm-backend/issues/356

New York Texi duration

Flight ETA Via SparkML

https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/making_predictions_sample.ipynb

Flight ETA Via GCloud AI Platform

CodeBear801 commented 4 years ago

Draft Machine Learning diagram, related with https://github.com/Telenav/osrm-backend/issues/357#issuecomment-647820582

Context Diagram

image

(click for large image)

Notes: the input here is the output component of flow in https://github.com/Telenav/osrm-backend/issues/357#issuecomment-647820582

Container Diagram

To do

Component Diagram

level 1 image

level 2 image

Training data/Test data sample format

trace_id, userid, start_position, end_position, duration, distance, osrm_legs, avg_speed, osrm_distance, osrm_duration, osrm_edge_list, spatial_index_cell_list...

ETA service query format

userid, start_position, end_position, distance, osrm_legs, avg_speed, osrm_distance, osrm_duration, osrm_edge_list, spatial_index_cell_list...

Notes: