Open CodeBear801 opened 4 years ago
Discussion on 07182020
more resources: https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/team_discussion_for_06182020.md
Briefly go-though: https://github.com/Telenav/osrm-backend/issues/356
Draft Machine Learning diagram, related with https://github.com/Telenav/osrm-backend/issues/357#issuecomment-647820582
(click for large image)
Notes: the input
here is the output
component of flow in https://github.com/Telenav/osrm-backend/issues/357#issuecomment-647820582
To do
level 1
level 2
Training data/Test data sample format
trace_id, userid, start_position, end_position, duration, distance, osrm_legs, avg_speed, osrm_distance, osrm_duration, osrm_edge_list, spatial_index_cell_list...
ETA service query format
userid, start_position, end_position, distance, osrm_legs, avg_speed, osrm_distance, osrm_duration, osrm_edge_list, spatial_index_cell_list...
Notes:
Training Model
accepts bounded data
(eg. past 3 month) and unbounded data
(eg. live data)
For testing, mainly will based on bounded data, which could be a csv
contains gps trace data, each line follows Training data/Test data sample format
Features which needs heavy calculation will be moved to steps prior to Model Training
, such as
Model Training
component suppose to generate following features(just for example, whether they are needed or not need evaluation)
For how to generate unbounded data based on user's trace, please go to https://github.com/Telenav/osrm-backend/issues/357#issuecomment-647820582
Different model for different purpose
Subtask of #355
We plan to build a machine learning model based on user's gps trace data. Here record some experiments and proof of concept for understanding the problem set.
There are several experiment I have done to get familiar with ML, here I record 3 of them which I feel is highly related:
Data and characteristics determine the upper limit of machine learning, and models and algorithms just approach this upper limit.
New York City Taxi Trip Duration from Kaggle
applied machine learning engineer
.python notebook
provided by website help to generate live statistic, and we could also download the docker image and deploy on other cloud(Kaggle Python docker image, hub, instruction)Data source
https://www.kaggle.com/c/nyc-taxi-trip-duration/data
OSRM features
route
gps traces
, we could do spatial index mapping to generate a list of spatial index box to represent the route more info, Google S2Weather feature
I think weather feature is crawling from open data website, you could find related data for this Kaggle competition here. More information you could go to here -> 6.1 Weather reports
Feature extracting
PCA
to transform longitude and latitude, help for decision tree splitsDistance
Datetime
Training
XGBoosting
Parameter Tune
Most of parameters in XGBoost are about bias variance tradeoff. When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit. XGBoost Parameters
Try with different parameters
It takes extremely large amount of resources and time.
Cross Validation
http://blog.mrtz.org/2015/03/09/competition.html
Flight Delay Estimation(gcloud)
Cloud Dataproc
is easy to do development and easy to scale. It lunches pre-build container image which contains tensorflow, python3, etc.pub/sub
system could simulate live streaming with batch dataDataflow
,Cloud Bigtable
,Data Studio
helps a lot with building streaming system, which will discuss more in #357apache beam
to aggregate data frompub/sub
-> record result ascsv
-> load data intocloud bigtable
-> trigger training with checkpoint, more info in #357Input Data
fixed
orig and destinationy = 0 if arrival delay >= 15 minutes y = 1 if arrival delay < 15 minutes // marching learning algorithm predict the probability that the flight is on time
Logic Regression via Spark
more info
After recording all data into
csv
, could load data intodataframe
orrdd
(difference), then generatedataframe
contains result afterfeatures engineering
, then callingtrain
prediction
evaluation
Tensorflow via Cloud Dataproc
More info
Why wide & Deep helps source
How to implement wide and deep
How tensor flow scales more info
How to scale gcloud ai platform
Flight Delay Estimation(open source stack)
Classification
more info
How to improve prediction model and how to evaluate
more info