Open rahulkulhalli opened 9 months ago
@rahulkulhalli
Interestingly, they use the Distance Matrix API from GMaps to query estimated travel time.
This is actually what Bingrong uses in her modeling as well. One challenge is the cost for querying large numbers of travel times on Google Maps. Zack tried this approach for a subset of the population, but was not able to get stable results even with it. You might want to experiment with
So we're basically doing
cost[section] = cost_factors_init[section] + (cost_factors[section] * distance[section])
the current implementation is using labels. Labels are currently at the trip level. So the current implementation is cost[trip]...
. Please review the OpenPATH data model (chapter 5 of my thesis).
In that case, could we could use section_modes and section_distances (remember what @shankari said above - if working at the trip level, take the maximum or work on the section level)?
Correct, we should not use labels while creating features. Yes, we can consider using the section_modes
and section_distances
at the trip level.
Retain the original mapping:
Why? I am not opposed to this, but in general, I want to see the thought process behind the design decisions - an outline of the pros and cons, weighing them and a final decision. I can see a con of this approach when it comes to MLOps/deploying on production, so would like to see the reasons for the pros to outweigh them.
@shankari
My main line of thought was to merge the modes with lower value counts so as to try and mitigate major class imbalance. I actually did do something along the same lines in one of my previous research endeavors (ISIC 2018) where I merged all the sparse labels into one and created a hierarchical pipeline of binary classifiers.
We could:
a) Derive inspiration from the other literature and aim to classify the replacement mode into {walk, bike, car, public transit} Pros: We avoid sparse labels and can drive up the volume for each existing label, resulting in more instances for the classifier to try and generalize Cons: We lose out on granularity - i.e., "gas car, drove alone" and "gas car, with others" would be treated the same, when in reality they should not. We would also need to figure out where labels like "Uber/Lyft" should be mapped. Is it a form of public transit or is it a car transit? If it's the latter, there might be a higher cost associated with the trip than using a regular car, which may throw the classifier off. Similarly, public transport vehicles generally make multiple stops before arriving at the destination. Since there are rarely any stops between the source and destination when we hail an Uber/Lyft, the classifier may be thrown off.
b) So then how about {walk, bike, car, public transit, uber}? That alleviates one of the issues mentioned above, but now it introduces another issue - how many instances of uber/lyft are seen in the dataset? If they are extremely sparse, adding a label would be counter-intuitive.
This is just a rough plot, but I wished to visually see how the labels are distributed.
drove_alone 23612
no_travel 18189
shared_ride 8618
walk 7449
bike 6056
bus 3103
Unlabeled 2676
taxi 1767
I do, however, acknowledge that I'm performing this analysis at a global level when in fact, the models will be user-level.
I do, however, acknowledge that I'm performing this analysis at a global level when in fact, the models will be user-level.
Why would the models be user-level? That is not what we have discussed earlier. Note that each user has one set of demographic labels. Again, I would like to see a high level outline of the proposed modeling process, including the level and the features.
b) So then how about {walk, bike, car, public transit, uber}? That alleviates one of the issues mentioned above, but now it introduces another issue - how many instances of uber/lyft are seen in the dataset? If they are extremely sparse, adding a label would be counter-intuitive.
the con from an MLOps/production perspective is that collapsing the values would make it harder to use this to calculate the downstream metrics such as the one below because we would not have as precise a carbon intensity for the replaced modes.
Note that none of the papers that you have outlined discuss the use of the model outputs. There is a tradeoff between model accuracy and its ability to provide useful outputs. That might be an area we explore in the paper by using the same approach on collapsed/non-collapsed labels.
That call put a lot of things in perspective for me!
To recount:
mode_confirm
will NOT be a valid input variable to the model at inference time.replaced_mode
is EXTREMELY valuable for downstream metrics and end-users because it allows us to show how much carbon emissions can be saved if they switched from their primary mode of transport to a more energy efficient modeSince we're not using mode_confirm
, I'd like to start computing the cost factor using our sensed modes. Here are the unique sensed modes we have at hand:
['walking', 'bicycling', 'car', 'no_sensed', 'bus', 'train',
'air_or_hsr']
The paper details costs per mile for the following modes (based on mode_confirm
):
Car, Shared Car, Ridehailing, Shared Micromobility, Transit
For our case, bus
and train
can be mapped to transit
and car
can be mapped to car
. I'm assuming walking
would have a cost of 0
and bicycling
would also be ~0
(if not 0). @shankari Is this assumption correct?
@rahulkulhalli The replaced mode paper was never published, so I would use it as a source of sources, but not as a citable source directly. i would check the cost source (VTPI) to see if they have separate values for bus and train and use them if they do, falling back to transit if they don't. NREL's MEP tool is another source of cost/time estimates - that does have a peer reviewed publication that you can cite.
e-bikes do have a small cost component, but it is negligible. However, I believe that either MEP or Bingrong's energy analysis used an estimate for e-bikes that was generated by Andy Duvall. We don't distinguish between e-bikes and bikes in the sensed mode, so it may be fine to go with 0. I would check that assumption with Andy before committing to it.
Couple of expansions, hope they are not more confusing:
Why do we need demographics? We want to see OTHER people in similar situations and who think and act in the same way. This allows us to make generalized decisions.
I think that this is actually a research question. Once we have a baseline model, I think that there are a few factors to explore around sensitivity analysis:
[1] For prior mode choice models, (you should check the related work), the models are typically built over a small per-user dataset (on the order of days or at most one week). But we have data over months. So we can build an individual model (not of trips, but of individual preferences around cost/time)) based on the users revealed preferences. It is not a-priori clear that this will do better than a demographic model - we will have less data per user, but we will not be subject to stereotypes from demographics either
@shankari The following table is captured directly from VTPI's "Cost and Benefit Analysis" survey. I think this gives us a great summary of average costs/mile.
This table illustrates costs/mile for 'regional rail', 'light rail' and 'heavy rail'. The survey does not mention what the difference between them is. According to the National Transit Database's glossary:
Since our data mainly concerns Denver, I'm assuming we should be selecting Light Rail. Would that be a valid assumption?
Also, I will reach out to Andy and ask him if our assumption about the e-bikes is valid.
I'm thinking about how we could discern the type of car
, since it could be either of car sharing
, uber/lyft
, or gas
. Depending on the type of car, we may assign the cost. What could we do from the available data, then?
car
mode instances? The obvious drawback of this approach is that we generalize the cost factor across every user.[walk, car, walk]
, could we glean any information about the type of car? Maybe, depending on how long/far the user walked before switching to the car
mode. If the user walks for a long time and then switches to the car mode, we may assume that the user hailed a Taxi/Uber. If the duration/distance of the first walk
instance is below some threshold, we may assume that the user walked to the driveway. However, people may also hail Ubers/Lyfts directly from their doorstep. Hmm.., I'm not sure about this.These are just rough ideas. @shankari, any feedback?
I've completed the method to compute the estimated cost. For now, the values are as follows:
mode_cost_per_mile = {
'walking': 0,
'bicycling': 0,
'car': 0.6,
'no_sensed': 0,
'bus': 1.59,
'train': 2.62
}
For now, I've also treated the init_costs
for all modes to be 0.
The bus and train costs have been taken from the VTPI report above. We're headed towards build the baseline model, but I need to think about the label space. Currently, we have the following unique replaced_mode
labels:
array(['bike', 'walk', 'no_travel', 'Unlabeled', 'shared_ride',
'drove_alone', 'train', 'scootershare', 'taxi', 'plane',
'free_shuttle', 'skateboard', 'bikeshare', 'e_bike', 'bus',
'not_a trip', 'zip_line', 'golf_cart', 'ski', 'run', 'air',
'emergency_vehicle with others', 'ebike', 'gondola',
'snowboarding', 'call_friend', 'no_replacement', 'na',
'i_walked to the toilet', 'zoo', 'doing_nothing', 'hiking',
'pilot_bike', 'not_accurate', 'not_a_trip', 'stolen_ebike',
'pilot_ebike', 'delivery', 'e-bike',
'walk_at home and drive car alone home', 'n/a', 'testing',
'pilot_e-bike', 'must_walk 3-5 mi a day for back', 'lunch', 'meal',
'home', 'family', 'entertainment', 'not_a trip, app malfunction',
'time_spent on the clock at amazon', 'working', 'walking_at work',
'kaiser', 'walk_at work', 'sitting_on my butt doing nothing',
'nothing._delivered food for work'], dtype=object)
I could:
We would retain more data this way and add a 'catch-all' mode, but I'm unsure whether this class would be interpretable in any downstream analytics
We'd only be left with the important samples, but we will lose some data points.
w.r.t discerning the type of car
, I also tried the following:
section_modes
could inform the type of car being driven. However, that too suffers from a logical pitfall: what if the user orders the Uber at the doorstep? It would still capture walk
as the first mode and car
as the second.Here, as we can see, the number of [car] instances is ~2.5x the number of [car. walking] trips and ['walking', 'car', 'walking'] has only 1269 instances.
['car'] 23413
['walking'] 20428
['car', 'walking'] 9722
['bicycling'] 6103
['no_sensed'] 2562
['bicycling', 'walking'] 2249
['walking', 'car', 'walking'] 1269
I am currently trying this approach and I think this is the most well-informed of the bunch:
number of people in household - number of people under the age of 18
. Since we also know how many motor vehicles the family has access to, we may postulate that the user used their own car to drive if number of motor vehicles >= number of people eligible to drive
. This theory may also be prone to logical holes, but I feel this one is at least informed using data and is a good starting point.Using the approach using the demographic information, I get the following distribution of car types. This seems like a better representation to me and unless there's an implementation issue, I think this is a pretty realistic heuristic. I shall conduct some more analysis today.
There is a third approach, which is to collapse the modes down into a set of base modes. See the "I know I'm right" paper and the energy estimation paper that you co-authored. Both of them work primarily with base modes.
We don't have a different cost estimate for taxi
anyway
Your model above doesn't distinguish between personal car and shared ride, which makes a large impact on the emission outcomes.
So what is the point of modeling uber versus car?
My main idea to differentiate Uber v/s car was for the cost estimate of the trip. According to VTPI:
Taxies typically charge $2.00 to $4.00 per mile, depending on type of service (standard or premium), trip length, location, and time. Ridehailing services typically charge 20-40% less, from $1.50 to $3.00 per mile
In contrast, the per-mile cost for personal car usage is $0.6/mile. Since this might make a significant impact in cost estimates, I figured we could try to distinguish between the type of car mode used.
I guess I am still not convinced that this is backed up by data.
This seems like a better representation to me and unless there's an implementation issue, I think this is a pretty realistic heuristic.
I see that you have a distribution of car types using this approach, but have not validated it against the labels that users provided. What is the accuracy of your heuristic-based mini-model compared to the user labels?
As a concrete counter-example, my family has one car and two people will licenses. We just take turns driving the one car. I have not taken a taxi for at least a year.
Also, I'm confused - are you trying to estimate the mode or the replaced mode? So far, your rationale:
Using the demographic information, we can compute the total number of people eligible to drive a car using number of people in household - number of people under the age of 18. Since we also know how many motor vehicles the family has access to, we may postulate that the user used their own car to drive if number of motor vehicles >= number of people eligible to drive. This theory may also be prone to logical holes, but I feel this one is at least informed using data and is a good starting point.
seems to be focused on estimating the actual mode
@shankari For the current analysis, I am trying to estimate the type of the car mode from the section_modes
.
My main idea to differentiate Uber v/s car was for the cost estimate of the trip. According to VTPI:
Taxies typically charge $2.00 to $4.00 per mile, depending on type of service (standard or premium), trip length, location, and time. Ridehailing services typically charge 20-40% less, from $1.50 to $3.00 per mile
In contrast, the per-mile cost for personal car usage is $0.6/mile. Since this might make a significant impact in cost estimates, I figured we could try to distinguish between the type of car mode used.
My rationale was:
Since cost is directly dependent on the transport mode, if we are able to discern the type of car used, we can avoid assigning a generalized cost to all the instances where a car
mode was used. Since we don't use Mode_confirm
for cost estimation anymore, I assumed that demographic information could help us with a rough idea of the type of car transport used.
ah so this was for the cost estimate of the alternatives, got it! However, if you are building a mini-model to estimate the trip travel mode, you should also validate its (mini) correctness.
It seems that the heuristics I thought of don't match the actual confirmed distribution. I'd like to also give some description of the heuristics themselves:
Car distance ratio:
compute ratio of distance traveled per mode
if the ratio of car usage is >= 0.9, label as "Car"
if it is < 0.9, label as "Uber"
if car is not present in the section modes at all, return "None"
Demographic info:
Use the number of motor vehicles owned and the number of residents with a valid license:
If n_motor_vehicles >= n_users_with_license, label as "Car"
If n_motor_vehicles < n_users_with_license, label as "Uber"
if car is not present in the section modes at all, return "None"
Demographics + car distance ratio:
If the ratio of car usage is >= 0.9 AND n_motor_vehicles >= n_users_with_license, label as "Car"
Else, label as "Uber"
if car is not present in the section modes at all, return "None"
With your permission, I'd like to try and model a mini logistic model for determine the car type using Mode_confirm
as a target variable. Once the model is trained, we may discard the model and use the underlying coefficients for future use. I wish to only see if I can extract any sort of meaningful information from the model.
Models are training. Will report on the results when available! 🤞
The multinomial logistic model performs pretty poorly with the best score across 3 CV folds to be a weighted F1 of about ~0.53.
The confusion matrix entries for the test data reveal some information about the individual classes - Gas car, drove alone
has an accuracy of ~51%, Gas car, with others
has an accuracy of ~35%, and Uber/Lyft
has an accuracy of ~72%.
The Gradient-boosted classifier, on the other hand, does much better.
Gas car, drove alone
is predicted correctly ~80% of the time, Gas car, with others
is predicted correctly ~84% of the time, and funnily enough, Taxi/Uber
is what this model struggles the most with, with an accuracy of 50%.
Now, we come to the Random forest model. It performs on par with the gradient-boosted model, and in fact, it does better on Uber/Lyft
class.
Accuracy for Gas car, drove alone
: 80%, Accuracy for Gas car, with others
: 78%, Accuracy for Taxi/Uber
: 60%. Admittedly, it's performance for the last class isn't as good as the logistic model (with an accuracy of 72% on Uber), but it may be a good 'in-between' model, retaining the training time of the logistic model and the performance of the gradient-boosted classifier.
One thing I observed is that Zack's notebook only uses the prepilot
data for training. Is this what I should also stick to? Or should I expand this analysis on the entire Stage, 4c, fc, cc,... prepilot
data? The prepilot
has 235 users, whereas the survey consists of 202 users and they have 170 common users among them.
Gas car, drove alone is predicted correctly ~80% of the time, Gas car, with others is predicted correctly ~84% of the time, and funnily enough, Taxi/Uber is what this model struggles the most with, with an accuracy of 50%.
That is interesting. Couple of thoughts:
One thing I observed is that Zack's notebook only uses the prepilot data for training. Is this what I should also stick to? Or should I expand this analysis on the entire Stage, 4c, fc, cc,... prepilot data? The prepilot has 235 users, whereas the survey consists of 202 users and they have 170 common users among them.
I bet it only uses pilot
data (not prepilot
) for testing. The actual prepilot only has 13 users. If the variable is called prepilot
it is wrong. The stage
users were not part of the ebike programs, and did not have access to ebikes. So their replaced modes are somewhat irrelevant. That is probably why he excluded them
Sure, @shankari.
Before sketching out my plan on using the results from these models, I would like to explain the experiment set-up:
The current models are trained on all data points that have at least one car
section mode reading. The indepedent variables that this model uses are:
'income_category', 'n_motor_vehicles', 'n_residents_with_license', 'is_male', 'age', 'sin_HOD', 'sin_DOM', 'cos_HOD', 'cos_DOM', 'car_distance_miles'
income_category is one-hot encoded
Numerical features are normalized using StandardScaler
HOD = hour of day
DOM = day of month
car_distance_miles is the total distance traveled by the car mode (in miles)
Once we obtain predictions for Gas car, drove alone
, Gas car, with others
, and Taxi/Uber
, we can use them for weighing the costs for the cost-time baseline model.
My main inspiration for the nested model was derived from nested logit models
, which are used fairly commonly in mode choice modeling - https://transp-or.epfl.ch/courses/ANTWERP07/08-nested.pdf
This is a rough illustration of my idea:
I think using a weighted average for the cost would be helpful in creating a baseline model. All our future incremental work could be compared to this model's performance. I will go ahead with the baseline hourly car expense ($0.6/mile) and create the baseline replacement mode model.
I concede that we may be adding too many cogs in the machine at once, which might make it difficult to trace any potential point of failure or optimization. To the best of my recollection, most of the literature I've read for travel mode choice preemptively has an idea of the type of car transport used. However, I think it is a good idea to assume uniform cost for all car modes for a start and create the baseline model first.
For the baseline model, I chose to go with the following features (the existing replaced mode mapping is being used here. They are being collapsed into a set of base modes: 'car', 's_car (shared car)', 's_micro (shared micromobility)', 'p_micro (personal micromobility)', 'walk', 'ridehail (Uber/taxi)', 'ebike', 'no_travel'
.
X = modeling_data[['estimated_cost', 'section_distance_argmax_miles', 'duration', 'distance_miles',
'age', 'n_residence_members', 'n_residents_with_license', 'is_male', 'sin_HOD', 'sin_DOM',
'cos_HOD', 'cos_DOM', 'income_category', 'n_motor_vehicles', 'is_weekend', 'section_mode_argmax'
]].copy()
Y = modeling_data[['Replaced_mode']].copy()
Tried using the logistic classifier, but got the following stats:
Best validation F1 for Logistic model: 0.27480382039025203
Test F1 for Logistic model: 0.2744886353547929
That does not look promising at all. Either the logistic models are not converging (since maxiters was set to 2000 only), or the model parameters are insufficient to model the target variable. The confusion matrix gives us some more information:
The per-class accuracies are:
{'Not a Trip': 1.0, 'Other': 0.6079136690647482, 'Unlabeled': 0.4606741573033708, 'car': 0.2076271186440678, 'ebike': 0.6923076923076923, 'no_travel': 0.16088154269972452, 'p_micro': 0.27228525121555913, 'ridehail': 0.42045454545454547, 's_car': 0.22531939605110338, 's_micro': 0.16666666666666666, 'transit': 0.25391849529780564, 'walk': 0.3602150537634409}
Clearly, the logistic model is struggling. It also struggles with the majority classes 'car' and 'no_travel'.
Using the random forest gives us a remarkable jump in performance - the best weighted F1 goes up from 0.27 to 0.72!
Best validation F1 for RF model: 0.7216160907675281
Test F1 for RF model: 0.7234734296071574
The confusion matrix shows us the model's prowess - much stronger confidences per class and much lesser inter-class variance as compared to the logistic model.
{'Not a Trip': 1.0, 'Other': 0.6942446043165468, 'Unlabeled': 0.6179775280898876, 'car': 0.7974576271186441, 'ebike': 0.46153846153846156, 'no_travel': 0.7851239669421488, 'p_micro': 0.7358184764991896, 'ridehail': 0.6363636363636364, 's_car': 0.5598141695702671, 's_micro': 0.16666666666666666, 'transit': 0.6050156739811913, 'walk': 0.6666666666666666}
It still struggles in the shared micro-mobility and e-bike (~46%), but it is still a marked improvement over the previous model. However, that could be explained because of the dearth of the samples found in the training set:
s_micro
and ebike
are the two least frequently occurring labels in the dataset.
Trying the gradient-boosted model now.
@rahulkulhalli I am not sure you are using the logistic regression model correctly. I see the featurization, but can you expand on how you you fill the row? Note our prior discussion around choice modeling, although you have not recorded details on the alternative trips. https://github.com/e-mission/e-mission-docs/issues/978#issuecomment-1739885897
Using the random forest gives us a remarkable jump in performance - the best weighted F1 goes up from 0.27 to 0.72!
If you are training the random forest on the existing replaced mode labels, you might want to write out the process carefully to make sure that we are not leaking information and that it is actually usable in the production scenario in terms of the data that we have available. Because that is indeed a very nice result :tada:
s_micro and ebike are the two least frequently occurring labels in the dataset.
EDIT: As an even simpler baseline, have you considered simply assigning the replaced mode in the label proportions? That is what Denver CASR suggested.
So you have 100 labeled trips with the following proportions of replaced mode: 0.5 car, 0.3 bus, 0.2 walk You have 500 unlabeled trips, you assign them randomly as 0.5 car, 0.3 bus, 0.2 walk How far off are you?
If the label distribution is very skewed (as in the case above), this may already give you good results. For example, if the replaced mode was 90% car, and you wrote an "algorithm" that just set all the predicted labels to car, you would get an accuracy of 90%
So you have 100 labeled trips with the following proportions of replaced mode: 0.5 car, 0.3 bus, 0.2 walk You have 500 unlabeled trips, you assign them randomly as 0.5 car, 0.3 bus, 0.2 walk How far off are you?
I think that would depend on whether the 500 unlabeled trips come from the same distribution. If the distribution changes, we might be adding inductive bias. However, this is an interesting experiment - I can definitely try this on our dataset and report the results.
As an even simpler baseline, have you considered simply assigning the replaced mode in the label proportions? That is what Denver CASR suggested.
I may be wrong, but are we referring to the 0.5, 0.3, 0.2 experiment you'd mentioned earlier here? If yes, I'm currently writing the code for that experiment. 😄
If you are training the random forest on the existing replaced mode labels, you might want to write out the process carefully to make sure that we are not leaking information and that it is actually usable in the production scenario in terms of the data that we have available. Because that is indeed a very nice result 🎉
Thank you! Yes, I will cross-check and verify that there is no information leakage in the code. I will design robust, scalable preprocessing and inference modules so that It can be used seamlessly on production.
This is the label distribution we're working with:
Using this, I conduct 100 experiments with the following idea:
predictions
vector. Using the GT and the predictions, compute the F1-score for the iteration.The average test F1 using this setup is 0.19.
The reason why I permuted the rows of Y_test was to add make the predictions more stochastic - we're not guaranteed to have the same label order at test time. If we're just blindly assigning predictions to samples without looking at the samples' attributes, then order should not matter.
Interesting. So the naive "just sample from the same random distribution" approach is pretty bad, even worse than logistic regression. The logistic regression might improve after you use the alternate modes, though
I will start looking into that 😄
Logging yesterday's work since I left for classes and forgot to update the thread!
For alternate modes, I figured that it'd be a good idea to start investigating the available_modes
attribute. I faced two challenges:
To these questions, I proposed two ideas to myself:
set A = {current mode}
and set B = {U - A}
), it might be a good idea to remove the current argmax section mode from the set of modes, so that we are left with all modes OTHER than the favored mode that are available to the user.section_modes
and available_modes
:section_mapping = {
'car': 'car',
'walking': 'walk',
'no_sensed': 'no_travel',
'bicycling': 'p_micro',
'train': 'transit',
'bus': 'transit'
}
Now, when we read the set of the user's available modes, we remove the mode that was sensed and map the remaining modes using the mapping above. However, there are a few cases where the user has no other available modes other than the sensed mode. In these cases, the set of alternate modes is empty. To counteract this, I include another entry in the alternate mode called "none".
class AlternateModeEncoder:
def __init__(self):
self.keyset = ['transit','s_car','p_micro','walk','ridehail','no_travel','s_micro','car','none']
self.mapper = dict(zip(self.keyset, range(len(self.keyset))))
self.features_out = None
def fit(self, X: pd.DataFrame):
if X is None:
raise AttributeError("Null dataframe")
if X.shape[0] == 0:
raise AttributeError("Empty dataframe")
mapped = list()
for _, row in X.iterrows():
mode_vector = [0 for _ in range(len(self.keyset))]
for m in row['alt_modes']:
mode_vector[self.mapper[m]] = 1
mapped.append(dict(zip(self.keyset, mode_vector)))
self.features_out = pd.DataFrame(mapped, index=X.index)
return self.features_out
A lot of optimization possible here, but I reckon this is a good starting point.
This is the head of the encoded feature dataframe:
Using the new data, the logistic model improves in performance.
Confusion Matrix:
The diagonals definitely have much higher values, which shows that the true positives have gone up. Incorporating my idea of alternate modes definitely seems to work favorably for the model.
The updated logistic model's per-class accuracies:
{'Not a Trip': 1.0, 'Other': 0.7581227436823105, 'Unlabeled': 0.602996254681648,
'car': 0.24798643493005512, 'ebike': 0.9230769230769231, 'no_travel': 0.28059536934950385,
'p_micro': 0.34683954619124796, 'ridehail': 0.6931818181818182, 's_car': 0.28604651162790695,
's_micro': 0.5, 'transit': 0.43573667711598746, 'walk': 0.396505376344086}
Compared to the previous performance:
{'Not a Trip': 1.0, 'Other': 0.6079136690647482, 'Unlabeled': 0.4606741573033708,
'car': 0.2076271186440678, 'ebike': 0.6923076923076923, 'no_travel': 0.16088154269972452,
'p_micro': 0.27228525121555913, 'ridehail': 0.42045454545454547, 's_car': 0.22531939605110338,
's_micro': 0.16666666666666666, 'transit': 0.25391849529780564, 'walk': 0.3602150537634409}
There is an improvement in every class' accuracy.
The random forest's performance also increased slightly!
Confusion matrix:
The new per-class accuracies:
{'Not a Trip': 1.0, 'Other': 0.7906137184115524, 'Unlabeled': 0.7116104868913857,
'car': 0.847392963119966, 'ebike': 0.6153846153846154, 'no_travel': 0.8379272326350606,
'p_micro': 0.7909238249594813, 'ridehail': 0.6477272727272727, 's_car': 0.6337209302325582,
's_micro': 0.3333333333333333, 'transit': 0.6394984326018809, 'walk': 0.7043010752688172}
As compared to the previous per-class performance:
{'Not a Trip': 1.0, 'Other': 0.6942446043165468, 'Unlabeled': 0.6179775280898876,
'car': 0.7974576271186441, 'ebike': 0.46153846153846156, 'no_travel': 0.7851239669421488,
'p_micro': 0.7358184764991896, 'ridehail': 0.6363636363636364, 's_car': 0.5598141695702671,
's_micro': 0.16666666666666666, 'transit': 0.6050156739811913, 'walk': 0.6666666666666666}
A consistent increase in accuracies is noted her as well.
Adding weather data improves the random forest's test F1 score by another 2%, which takes it to 80%!
The updated CM:
And the updated per-class test performance:
{'Not a Trip': 1.0, 'Other': 0.9028776978417267, 'Unlabeled': 0.846441947565543,
'car': 0.8732513777024162, 'ebike': 0.46153846153846156, 'no_travel': 0.86438809261301,
'p_micro': 0.7909238249594813, 'ridehail': 0.6590909090909091, 's_car': 0.6216530849825378,
's_micro': 0.25, 'transit': 0.6394984326018809, 'walk': 0.7701612903225806}
Other
jumps from 0.79 -> 0.9, Unlabeled
improves from 0.71 -> 0.84, there is a slight improvement in Car
, but the ebike
and s_micro
performance drastically reduces from 0.61 -> 0.46 and 0.33 -> 0.25 respectively. Walk
gains a slight performance boost, going from 0.7 -> 0.79
I can see that the model may be improving in overall F1, but we aren't uniformly improving - the improvement in some labels is coming at the cost of a decrease in performance in some other labels. The train-test splits used across both experiments were the same, and the only thing that was added was the weather attributes:
['temperature_2m (°C)', 'relativehumidity_2m (%)', 'dewpoint_2m (°C)', 'rain (mm)',
'snowfall (cm)', 'cloudcover (%)', 'windspeed_10m (km/h)']
According to RF's feature importance, dew point, temperature, and relative humidity are influential in the model's decision-making process. So what would happen if I remove those features and re-train the model?
Okay, so my intuition was somewhat on the right track - removing the unnecessary weather attributes not only maintained the F1 score, but also boosts the performance of the affected labels. 💪
{'Not a Trip': 1.0, 'Other': 0.8992805755395683, 'Unlabeled': 0.8352059925093633,
'car': 0.8753709198813057, 'ebike': 0.3076923076923077, 'no_travel': 0.8638368246968027,
'p_micro': 0.7974068071312804, 'ridehail': 0.6420454545454546, 's_car': 0.6181606519208381,
's_micro': 0.25, 'transit': 0.6394984326018809, 'walk': 0.760752688172043}
s_micro
still stays at 0.25, but all the other affected labels seem to now be stable.
In addition to the above-mentioned weather variables, I dropped some more variables that were the least important to the model:
total_feature_set.remove('is_overnight_trip')
total_feature_set.remove('is_male')
total_feature_set.remove('start:is_weekend')
total_feature_set.remove('end:is_weekend'))
New label performances:
{'Not a Trip': 1.0,
'Other': 0.8992805755395683,
'Unlabeled': 0.8426966292134831,
'car': 0.8732513777024162,
'ebike': 0.38461538461538464,
'no_travel': 0.8627342888643881,
'p_micro': 0.8022690437601296,
'ridehail': 0.6647727272727273,
's_car': 0.6204889406286379,
's_micro': 0.25,
'transit': 0.6489028213166145,
'walk': 0.7620967741935484}
Couple of high-level comments:
For alternate modes, I figured that it'd be a good idea to start investigating the available_modes attribute. I faced two challenges:
This may be fine for the random forest model, but I am not sure you are using the alternate modes correctly for the logistic regression model. It is not 100% clear from the description above how incorporate the alternate modes into the feature set. Can you clarify? What are the coefficients of the logistic regression model, for example?
but the ebike and s_micro performance drastically reduces from 0.61 -> 0.46 and 0.33 -> 0.25 respectively. Walk gains a slight performance boost, going from 0.7 -> 0.79
The number of labels for s_micro
as a replaced mode is likely very small (you should verify). I am not sure that you will ever get a great result for it. Does combining small% labels into something like "other" help? With random forest, I am not sure that it will, since then the "other" labels will just have more complex rules for their prediction. But it may help with logistic regression.
That brings up a higher-level question on the use of this model. As I am sure you know, we deploy OpenPATH in multiple locations. The dataset you are using is from only one of those locations. Since this is a behavior/choice model and not a sensor-based model, we need to think through how we plan to train and deploy this model for different locations.
Today, I plan on restructuring the pipeline for ease of reproducibility. Specifically, I will add documentation and comments wherever necessary and commit the code to my forked repository, allowing for ease of review.
@rahulkulhalli can we address the first two questions above before restructuring and committing?
Definitely, Dr. Shankari. I am formulating my responses to your comments right now. I will not start restructuring without receiving a go-ahead from you.
This may be fine for the random forest model, but I am not sure you are using the alternate modes correctly for the logistic regression model. It is not 100% clear from the description above how incorporate the alternate modes into the feature set. Can you clarify? What are the coefficients of the logistic regression model, for example?
Definitely, Dr. Shankari. I use the available_modes
feature for determining alternate modes. I will try and explain using some code and my rationale behind the implementation:
mode_mapping = {
'Public transportation (bus, subway, light rail, etc.)': 'transit',
'Get a ride from a friend or family member': 's_car',
'Bicycle': 'p_micro',
'Walk/roll': 'walk',
'Taxi (regular taxi, Uber, Lyft, etc)': 'ridehail',
'None': 'no_travel',
'Shared bicycle or scooter': 's_micro',
'Rental car (including Zipcar/ Car2Go)': 'car',
'Skateboard': 'p_micro',
'Do not have vehicle ': 'no_travel'
}
First, I map the available modes into our target feature labels. There are some reasons why I don't directly map from the available mode to the section mode:
Instead, what I chose to do was to map both the section modes as well as the available modes into the same label space as the target labels. This allows for a closer mapping for the available modes.
section_mapping = {
'car': 'car',
'walking': 'walk',
'no_sensed': 'no_travel',
'bicycling': 'p_micro',
'train': 'transit',
'bus': 'transit'
}
Similarly, the argmax-ed section modes are also mapped to the target label space.
Once both the features are mapped to a normalized space, I remove the current argmax-ed mapped feature from the mapped available modes and return them as a binary feature vector.
As a concrete example,
argmax_section_mode = "car"
mapped_section_mode = ["car"]
available_modes = ['Public transportation (bus, subway, light rail, etc.)', 'Get a ride from a friend or family member', 'Shared bicycle or scooter', 'Walk/Roll', 'Taxi (regular taxi, Uber, Lyft, etc)']
mapped_available_modes = ['transit', 'walk', 's_micro', 'ridehail']
# Remove the mapped_section_mode from the mapped_available_modes. What we are left with would be the modes that this user would use if their current mode was not available.
# Convert the mapped modes to a feature vector, where 1 indicates the presence of a mode and 0 indicates the absence.
An example of how this vector would look like: [0, 1, 0, 1, 1, 0, 0, 0, 0]
The number of labels for s_micro as a replaced mode is likely very small (you should verify). I am not sure that you will ever get a great result for it. Does combining small% labels into something like "other" help? With random forest, I am not sure that it will, since then the "other" labels will just have more complex rules for their prediction. But it may help with logistic regression.
Yes, I agree. Not a trip
, s_micro
, and ebike
have the least occurrences in the dataset. I will try combining it with them with Other
and check the logistic model's performance and report on what I observe in the parameters as well as the per-class performance.
Definitely, Dr. Shankari. I use the available_modes feature for determining alternate modes. I will try and explain using some code and my rationale behind the implementation:
I saw this in the previous commits as well (https://github.com/e-mission/e-mission-docs/issues/978#issuecomment-1759726679). What I want to know is how you are using them in the features after the mapping.
After training the new logistic model, the performance goes up slightly.
To find which model parameter is the most sensitive, we could try perturbing the inputs of each feature and measure the difference In performance (score_with_perturbation - score_without_perturbation
). The highest sensitivity might be the feature that the model deems more significant.
Creating this issue to document my observations, readings, and development efforts towards building a solution for predicting the replaced mode in the absence of inferred labels.