Machine Learning Experiment for ETA service

CodeBear801 commented 4 years ago

Subtask of #355

We plan to build a machine learning model based on user's gps trace data. Here record some experiments and proof of concept for understanding the problem set.

There are several experiment I have done to get familiar with ML, here I record 3 of them which I feel is highly related:

New York City Taxi Trip Duration from Kaggle
Flight Delay Estimation(gcloud)
Flight Delay Estimation(open source stack)

Data and characteristics determine the upper limit of machine learning, and models and algorithms just approach this upper limit.

New York City Taxi Trip Duration from Kaggle

Keyword: XGBoost, PCA, data visualization, osrm
Popular notebooks: NYC Taxi EDA - Update: The fast & the curious, Strength of visualization-python visuals tutorial, From EDA to the Top (LB 0.367)
My experiment: From EDA to the Top (LB 0.367), Strength of visualization-python visuals tutorial
Summary:
- Kaggle's solution is good for inspiration and hands-on experiment but is far away from production. There are certain patterns in Kaggle competition, most of Kaggle winner uses XGBoost, or artificial neural network for unstructured data.
- But it helps me to think like a applied machine learning engineer.
- Kaggle provide a convenient environment for ML, python notebook provided by website help to generate live statistic, and we could also download the docker image and deploy on other cloud(Kaggle Python docker image, hub, instruction)
Background: discussion in OSRM's community

Data source

https://www.kaggle.com/c/nyc-taxi-trip-duration/data

    id  vendor_id   pickup_datetime dropoff_datetime    passenger_count pickup_longitude    pickup_latitude dropoff_longitude   dropoff_latitude    store_and_fwd_flag  trip_duration
0   id2875421   2   2016-03-14 17:24:55 2016-03-14 17:32:30 1   -73.982155  40.767937   -73.964630  40.765602   N   455
1   id2377394   1   2016-06-12 00:43:35 2016-06-12 00:54:38 1   -73.980415  40.738564   -73.999481  40.731152   N   663

No GPS traces
The scenario has been set to NY, means training data and test data all exists in NY
1458644 trip records in train.csv, and 625134 trip records in test.csv
- If we consider to cluster orig point and destination point, each cluster pair(orig-destination location pair) has multiple(lots) of data coverage

OSRM features

id	total_distance	total_travel_time	number_of_steps
id2875421	2009.1	164.9	5
id2377394	2513.2	332.0	6
id3504673	1779.4	235.8	4

OSRM route is calculated based on orig/dest point, which will generate distance, duration, number of steps to represent the route
- When we have gps traces, we could do spatial index mapping to generate a list of spatial index box to represent the route more info, Google S2
- Or, we could do map matching, try to snap points to a list of navigable edges in the graph, then extract more features more info

Weather feature

I think weather feature is crawling from open data website, you could find related data for this Kaggle competition here. More information you could go to here -> 6.1 Weather reports

Feature extracting

PCA to transform longitude and latitude, help for decision tree splits
Distance
Normalize Datetime
Speed
Clustering orig and dest
Temporal and geospatial aggregation

Training

XGBoosting

xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,
            'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}

model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)

Parameter Tune

Most of parameters in XGBoost are about bias variance tradeoff. When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit. XGBoost Parameters

Try with different parameters

# random search on different parameter combination
xgb_pars = []
for MCW in [10, 20, 50, 75, 100]:
    for ETA in [0.05, 0.1, 0.15]:
        for CS in [0.3, 0.4, 0.5]:
            for MD in [6, 8, 10, 12, 15]:
                for SS in [0.5, 0.6, 0.7, 0.8, 0.9]:
                    for LAMBDA in [0.5, 1., 1.5,  2., 3.]:
                        xgb_pars.append({'min_child_weight': MCW, 'eta': ETA, 
                                         'colsample_bytree': CS, 'max_depth': MD,
                                         'subsample': SS, 'lambda': LAMBDA, 
                                         'nthread': -1, 'booster' : 'gbtree', 'eval_metric': 'rmse',
                                         'silent': 1, 'objective': 'reg:linear'})

It takes extremely large amount of resources and time.

Cross Validation

http://blog.mrtz.org/2015/03/09/competition.html

Flight Delay Estimation(gcloud)

Keyword: SparkML, Logistic Regression, Tensorflow, Wide-and-Deep, Cloud Dataproc
My experiment: notes
Summary:
- Cloud Dataproc is easy to do development and easy to scale. It lunches pre-build container image which contains tensorflow, python3, etc.
- Use Google's pub/sub system could simulate live streaming with batch data
- Dataflow, Cloud Bigtable, Data Studio helps a lot with building streaming system, which will discuss more in #357
- During test, we use batch data(like one month's flight data) as input into machine learning pipeline
- In live streaming system, using apache beam to aggregate data from pub/sub -> record result as csv -> load data into cloud bigtable -> trigger training with checkpoint, more info in #357

Input Data

|summary|   FL_DATE|UNIQUE_CARRIER|        AIRLINE_ID|CARRIER|            FL_NUM| ORIGIN_AIRPORT_ID|ORIGIN_AIRPORT_SEQ_ID|ORIGIN_CITY_MARKET_ID|ORIGIN|   DEST_AIRPORT_ID|DEST_AIRPORT_SEQ_ID|DEST_CITY_MARKET_ID|DEST|       CRS_DEP_TIME|           DEP_TIME|         DEP_DELAY|          TAXI_OUT|         WHEELS_OFF|          WHEELS_ON|          TAXI_IN|       CRS_ARR_TIME|           ARR_TIME|         ARR_DELAY|           CANCELLED|CANCELLATION_CODE|            DIVERTED|         DISTANCE|   DEP_AIRPORT_LAT|   DEP_AIRPORT_LON|DEP_AIRPORT_TZOFFSET|   ARR_AIRPORT_LAT|   ARR_AIRPORT_LON|ARR_AIRPORT_TZOFFSET|EVENT|NOTIFY_TIME|

Explanation of attributes: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
Always keep separate data storage and computation in mind
Massive amount of data, and data has fixed orig and destination

y = 0 if arrival delay >= 15 minutes y = 1 if arrival delay < 15 minutes // marching learning algorithm predict the probability that the flight is on time

Logic Regression via Spark

more info

After recording all data into csv, could load data into dataframe or rdd(difference), then generate dataframe contains result after features engineering, then calling train

examples = traindata.rdd.map(udf)
lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)

prediction

lrmodel = LogisticRegressionModel.load(sc, MODEL_FILE)
lrmodel.setThreshold(xxx)
lrmodel.predict(Independent features)

evaluation

def eval(labelpred):
    cancel = labelpred.filter(lambda (label, pred): pred < 0.7)
    nocancel = labelpred.filter(lambda (label, pred): pred >= 0.7)
    corr_cancel = cancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
    corr_nocancel = nocancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()

    cancel_denom = cancel.count()
    nocancel_denom = nocancel.count()
    if cancel_denom == 0:
        cancel_denom = 1
    if nocancel_denom == 0:
        nocancel_denom = 1
    return {'total_cancel': cancel.count(), \
            'correct_cancel': float(corr_cancel)/cancel_denom, \
            'total_noncancel': nocancel.count(), \
            'correct_noncancel': float(corr_nocancel)/nocancel_denom \
           }

Tensorflow via Cloud Dataproc

More info

Why wide & Deep helps source
How to implement wide and deep
How tensor flow scales more info
How to scale gcloud ai platform

Flight Delay Estimation(open source stack)

Keyword: SparkML, Scikit-Learn, MongoDB, Kafka
My experiment: note
Summary
- During development, I build all dependencies and python-connector into docker
- For the development stage, need to config each docker image and make sure dependencies could work
- During scale, need docker orchestration tools such as K8S
- If the environment is well set, development in local is similar as develop on public cloud like gcloud and aws, but harder to manage

Classification

more info

How to improve prediction model and how to evaluate

more info

CodeBear801 commented 4 years ago

Discussion on 07182020

more resources: https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/team_discussion_for_06182020.md

Briefly go-though: https://github.com/Telenav/osrm-backend/issues/356

New York Texi duration

https://www.kaggle.com/liuxun801/from-eda-to-the-top-lb-0-367
Topics
- Data Input, target
- XGBoost
- PCA, Cluster, OSRM feature add-on

Flight ETA Via SparkML

https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/making_predictions_sample.ipynb

Topics
- Data Input, target
- Feature abstraction with Spark
- Classification with Spark(how to decide bucket)
- Evaluation
- Why parquet
- Why Dataframe not RDD
- Optimization(optional)

Flight ETA Via GCloud AI Platform

Topics
- ENV setup
- Feature abstraction via Tensorflow
- Training and evaluation
- Wide & Deep Learning(optional)
- Scale(How to scale with GCloud, How Tensorflow Scale)

CodeBear801 commented 4 years ago

Draft Machine Learning diagram, related with https://github.com/Telenav/osrm-backend/issues/357#issuecomment-647820582

Context Diagram

(click for large image)

Notes: the input here is the output component of flow in https://github.com/Telenav/osrm-backend/issues/357#issuecomment-647820582

Container Diagram

To do

Component Diagram

level 1

level 2

Training data/Test data sample format

trace_id, userid, start_position, end_position, duration, distance, osrm_legs, avg_speed, osrm_distance, osrm_duration, osrm_edge_list, spatial_index_cell_list...

ETA service query format

userid, start_position, end_position, distance, osrm_legs, avg_speed, osrm_distance, osrm_duration, osrm_edge_list, spatial_index_cell_list...

Notes:

Training Model accepts bounded data(eg. past 3 month) and unbounded data(eg. live data)
For testing, mainly will based on bounded data, which could be a csv contains gps trace data, each line follows Training data/Test data sample format
Features which needs heavy calculation will be moved to steps prior to Model Training, such as
- OSRM route calculation
- Map Matching
- Spatial index mapping for trace points
- Live traffic/historical speed injection for spatial cells
- Weather abstraction
Model Training component suppose to generate following features(just for example, whether they are needed or not need evaluation)
- PCA for orig and destination
- clustering
- format time
- great circle distance between orig and destination
For how to generate unbounded data based on user's trace, please go to https://github.com/Telenav/osrm-backend/issues/357#issuecomment-647820582
Different model for different purpose
- If we want to estimate ETA for a specific user's usual route(let's say estimate ETA for a daily commute, this is detected by other service and passed to ETA service with a flag), it should be per user per pattern
- If we want to estimate ETA for a generic user's random query, we need single model for all these kind of requests

Telenav / osrm-backend