Predicting replaced mode for trips with no inferred labels

rahulkulhalli commented 9 months ago

Creating this issue to document my observations, readings, and development efforts towards building a solution for predicting the replaced mode in the absence of inferred labels.

shankari commented 9 months ago

@rahulkulhalli

Interestingly, they use the Distance Matrix API from GMaps to query estimated travel time.

This is actually what Bingrong uses in her modeling as well. One challenge is the cost for querying large numbers of travel times on Google Maps. Zack tried this approach for a subset of the population, but was not able to get stable results even with it. You might want to experiment with

So we're basically doing cost[section] = cost_factors_init[section] + (cost_factors[section] * distance[section])

the current implementation is using labels. Labels are currently at the trip level. So the current implementation is cost[trip].... Please review the OpenPATH data model (chapter 5 of my thesis).

In that case, could we could use section_modes and section_distances (remember what @shankari said above - if working at the trip level, take the maximum or work on the section level)?

Correct, we should not use labels while creating features. Yes, we can consider using the section_modes and section_distances at the trip level.

Retain the original mapping:

Why? I am not opposed to this, but in general, I want to see the thought process behind the design decisions - an outline of the pros and cons, weighing them and a final decision. I can see a con of this approach when it comes to MLOps/deploying on production, so would like to see the reasons for the pros to outweigh them.

rahulkulhalli commented 9 months ago

@shankari

My main line of thought was to merge the modes with lower value counts so as to try and mitigate major class imbalance. I actually did do something along the same lines in one of my previous research endeavors (ISIC 2018) where I merged all the sparse labels into one and created a hierarchical pipeline of binary classifiers.

We could:

a) Derive inspiration from the other literature and aim to classify the replacement mode into {walk, bike, car, public transit} Pros: We avoid sparse labels and can drive up the volume for each existing label, resulting in more instances for the classifier to try and generalize Cons: We lose out on granularity - i.e., "gas car, drove alone" and "gas car, with others" would be treated the same, when in reality they should not. We would also need to figure out where labels like "Uber/Lyft" should be mapped. Is it a form of public transit or is it a car transit? If it's the latter, there might be a higher cost associated with the trip than using a regular car, which may throw the classifier off. Similarly, public transport vehicles generally make multiple stops before arriving at the destination. Since there are rarely any stops between the source and destination when we hail an Uber/Lyft, the classifier may be thrown off.

b) So then how about {walk, bike, car, public transit, uber}? That alleviates one of the issues mentioned above, but now it introduces another issue - how many instances of uber/lyft are seen in the dataset? If they are extremely sparse, adding a label would be counter-intuitive.

rahulkulhalli commented 9 months ago

This is just a rough plot, but I wished to visually see how the labels are distributed.

drove_alone                              23612
no_travel                                18189
shared_ride                               8618
walk                                      7449
bike                                      6056
bus                                       3103
Unlabeled                                 2676
taxi                                      1767

I do, however, acknowledge that I'm performing this analysis at a global level when in fact, the models will be user-level.

shankari commented 9 months ago

I do, however, acknowledge that I'm performing this analysis at a global level when in fact, the models will be user-level.

Why would the models be user-level? That is not what we have discussed earlier. Note that each user has one set of demographic labels. Again, I would like to see a high level outline of the proposed modeling process, including the level and the features.

b) So then how about {walk, bike, car, public transit, uber}? That alleviates one of the issues mentioned above, but now it introduces another issue - how many instances of uber/lyft are seen in the dataset? If they are extremely sparse, adding a label would be counter-intuitive.

the con from an MLOps/production perspective is that collapsing the values would make it harder to use this to calculate the downstream metrics such as the one below because we would not have as precise a carbon intensity for the replaced modes.

Note that none of the papers that you have outlined discuss the use of the model outputs. There is a tradeoff between model accuracy and its ability to provide useful outputs. That might be an area we explore in the paper by using the same approach on collapsed/non-collapsed labels.

rahulkulhalli commented 9 months ago

That call put a lot of things in perspective for me!

To recount:

We are NOT building on top of the LabelAssist model!
A mode choice model is something that models human choice-making behavior, not just transport modes.
What we're trying to do here is to model a user's preference (cost and time) given their travel history
Why do we need demographics? We want to see OTHER people in similar situations and who think and act in the same way. This allows us to make generalized decisions.
mode_confirm will NOT be a valid input variable to the model at inference time.
Point 4 is a solid reason why user-centric models are NOT to be used. You can't compare with other users in a localized model!
Referencing the graph above, replaced_mode is EXTREMELY valuable for downstream metrics and end-users because it allows us to show how much carbon emissions can be saved if they switched from their primary mode of transport to a more energy efficient mode
The cost factors in the manuscript have been taken from VTPI (if I'm not mistaken) and should be the ones used for cost calculation
Choose the sensed mode with the highest distance AND compute the cost of that mode only. It doesn't make sense to choose a single mode and compute the cost of the entire trip
Dr. @shankari is very environmentally aware!

rahulkulhalli commented 9 months ago

Since we're not using mode_confirm, I'd like to start computing the cost factor using our sensed modes. Here are the unique sensed modes we have at hand:

['walking', 'bicycling', 'car', 'no_sensed', 'bus', 'train',
       'air_or_hsr']

The paper details costs per mile for the following modes (based on mode_confirm): Car, Shared Car, Ridehailing, Shared Micromobility, Transit

For our case, bus and train can be mapped to transit and car can be mapped to car. I'm assuming walking would have a cost of 0 and bicycling would also be ~0 (if not 0). @shankari Is this assumption correct?

shankari commented 9 months ago

@rahulkulhalli The replaced mode paper was never published, so I would use it as a source of sources, but not as a citable source directly. i would check the cost source (VTPI) to see if they have separate values for bus and train and use them if they do, falling back to transit if they don't. NREL's MEP tool is another source of cost/time estimates - that does have a peer reviewed publication that you can cite.

e-bikes do have a small cost component, but it is negligible. However, I believe that either MEP or Bingrong's energy analysis used an estimate for e-bikes that was generated by Andy Duvall. We don't distinguish between e-bikes and bikes in the sensed mode, so it may be fine to go with 0. I would check that assumption with Andy before committing to it.

shankari commented 9 months ago

Couple of expansions, hope they are not more confusing:

Why do we need demographics? We want to see OTHER people in similar situations and who think and act in the same way. This allows us to make generalized decisions.

I think that this is actually a research question. Once we have a baseline model, I think that there are a few factors to explore around sensitivity analysis:

user versus generalized: [1]
basemode of the mode_confirm (where present) for training ONLY

[1] For prior mode choice models, (you should check the related work), the models are typically built over a small per-user dataset (on the order of days or at most one week). But we have data over months. So we can build an individual model (not of trips, but of individual preferences around cost/time)) based on the users revealed preferences. It is not a-priori clear that this will do better than a demographic model - we will have less data per user, but we will not be subject to stereotypes from demographics either

rahulkulhalli commented 9 months ago

@shankari The following table is captured directly from VTPI's "Cost and Benefit Analysis" survey. I think this gives us a great summary of average costs/mile.

This table illustrates costs/mile for 'regional rail', 'light rail' and 'heavy rail'. The survey does not mention what the difference between them is. According to the National Transit Database's glossary:

Since our data mainly concerns Denver, I'm assuming we should be selecting Light Rail. Would that be a valid assumption?

Also, I will reach out to Andy and ask him if our assumption about the e-bikes is valid.

rahulkulhalli commented 9 months ago

I'm thinking about how we could discern the type of car, since it could be either of car sharing, uber/lyft, or gas. Depending on the type of car, we may assign the cost. What could we do from the available data, then?

Calculate the average cost of all car modes and assign that to all car mode instances? The obvious drawback of this approach is that we generalize the cost factor across every user.
Use the section_mode information? If the section_modes for a user are [walk, car, walk], could we glean any information about the type of car? Maybe, depending on how long/far the user walked before switching to the car mode. If the user walks for a long time and then switches to the car mode, we may assume that the user hailed a Taxi/Uber. If the duration/distance of the first walk instance is below some threshold, we may assume that the user walked to the driveway. However, people may also hail Ubers/Lyfts directly from their doorstep. Hmm.., I'm not sure about this.
Look at the frequency/distance that was traveled using the car mode? Users would know that long distance travels are infeasible using taxis or Ubers, so it may be more likely that longer distances correspond to the usage of personal cars?

These are just rough ideas. @shankari, any feedback?

rahulkulhalli commented 9 months ago

I've completed the method to compute the estimated cost. For now, the values are as follows:

mode_cost_per_mile = {
    'walking': 0,
    'bicycling': 0,
    'car': 0.6,
    'no_sensed': 0,
    'bus': 1.59,
    'train': 2.62
}

For now, I've also treated the init_costs for all modes to be 0.

The bus and train costs have been taken from the VTPI report above. We're headed towards build the baseline model, but I need to think about the label space. Currently, we have the following unique replaced_mode labels:

array(['bike', 'walk', 'no_travel', 'Unlabeled', 'shared_ride',
       'drove_alone', 'train', 'scootershare', 'taxi', 'plane',
       'free_shuttle', 'skateboard', 'bikeshare', 'e_bike', 'bus',
       'not_a trip', 'zip_line', 'golf_cart', 'ski', 'run', 'air',
       'emergency_vehicle with others', 'ebike', 'gondola',
       'snowboarding', 'call_friend', 'no_replacement', 'na',
       'i_walked to the toilet', 'zoo', 'doing_nothing', 'hiking',
       'pilot_bike', 'not_accurate', 'not_a_trip', 'stolen_ebike',
       'pilot_ebike', 'delivery', 'e-bike',
       'walk_at home and drive car alone home', 'n/a', 'testing',
       'pilot_e-bike', 'must_walk 3-5 mi a day for back', 'lunch', 'meal',
       'home', 'family', 'entertainment', 'not_a trip, app malfunction',
       'time_spent on the clock at amazon', 'working', 'walking_at work',
       'kaiser', 'walk_at work', 'sitting_on my butt doing nothing',
       'nothing._delivered food for work'], dtype=object)

I could:

merge all the unnecessary labels (stolen_ebike, kaiser, zoo, etc.) into one label called 'Other'

We would retain more data this way and add a 'catch-all' mode, but I'm unsure whether this class would be interpretable in any downstream analytics

Drop the observations with the unnecessary labels

We'd only be left with the important samples, but we will lose some data points.

rahulkulhalli commented 9 months ago

w.r.t discerning the type of car, I also tried the following:

All instances where > 90% of the trip is covered by car is a personal car, Uber/lyft otherwise. The obvious drawback of using this approach is that we're making an ill-informed judgement using a heuristic that is NOT backed-up with data. Most of the observations fell under the second category. The following image is the percentage of trip distance taken by car across the entire dataset:

I thought about whether the scenario where the previous section_modes could inform the type of car being driven. However, that too suffers from a logical pitfall: what if the user orders the Uber at the doorstep? It would still capture walk as the first mode and car as the second.

Here, as we can see, the number of [car] instances is ~2.5x the number of [car. walking] trips and ['walking', 'car', 'walking'] has only 1269 instances.

['car']                                 23413
['walking']                             20428
['car', 'walking']                       9722
['bicycling']                            6103
['no_sensed']                            2562
['bicycling', 'walking']                 2249
['walking', 'car', 'walking']            1269

I am currently trying this approach and I think this is the most well-informed of the bunch:

Using the demographic information, we can compute the total number of people eligible to drive a car using number of people in household - number of people under the age of 18. Since we also know how many motor vehicles the family has access to, we may postulate that the user used their own car to drive if number of motor vehicles >= number of people eligible to drive. This theory may also be prone to logical holes, but I feel this one is at least informed using data and is a good starting point.
Another possible idea may be to look at the trip at the section level to incorporate not just the distance, but the duration.

rahulkulhalli commented 9 months ago

Using the approach using the demographic information, I get the following distribution of car types. This seems like a better representation to me and unless there's an implementation issue, I think this is a pretty realistic heuristic. I shall conduct some more analysis today.

shankari commented 9 months ago

There is a third approach, which is to collapse the modes down into a set of base modes. See the "I know I'm right" paper and the energy estimation paper that you co-authored. Both of them work primarily with base modes.

We don't have a different cost estimate for taxi anyway Your model above doesn't distinguish between personal car and shared ride, which makes a large impact on the emission outcomes.

So what is the point of modeling uber versus car?

rahulkulhalli commented 9 months ago

My main idea to differentiate Uber v/s car was for the cost estimate of the trip. According to VTPI:

Taxies typically charge $2.00 to $4.00 per mile, depending on type of service (standard or premium), trip length, location, and time. Ridehailing services typically charge 20-40% less, from $1.50 to $3.00 per mile

In contrast, the per-mile cost for personal car usage is $0.6/mile. Since this might make a significant impact in cost estimates, I figured we could try to distinguish between the type of car mode used.

shankari commented 9 months ago

I guess I am still not convinced that this is backed up by data.

This seems like a better representation to me and unless there's an implementation issue, I think this is a pretty realistic heuristic.

I see that you have a distribution of car types using this approach, but have not validated it against the labels that users provided. What is the accuracy of your heuristic-based mini-model compared to the user labels?

As a concrete counter-example, my family has one car and two people will licenses. We just take turns driving the one car. I have not taken a taxi for at least a year.

shankari commented 9 months ago

Also, I'm confused - are you trying to estimate the mode or the replaced mode? So far, your rationale:

Using the demographic information, we can compute the total number of people eligible to drive a car using number of people in household - number of people under the age of 18. Since we also know how many motor vehicles the family has access to, we may postulate that the user used their own car to drive if number of motor vehicles >= number of people eligible to drive. This theory may also be prone to logical holes, but I feel this one is at least informed using data and is a good starting point.

seems to be focused on estimating the actual mode

rahulkulhalli commented 9 months ago

@shankari For the current analysis, I am trying to estimate the type of the car mode from the section_modes.

My main idea to differentiate Uber v/s car was for the cost estimate of the trip. According to VTPI:

Taxies typically charge $2.00 to $4.00 per mile, depending on type of service (standard or premium), trip length, location, and time. Ridehailing services typically charge 20-40% less, from $1.50 to $3.00 per mile

In contrast, the per-mile cost for personal car usage is $0.6/mile. Since this might make a significant impact in cost estimates, I figured we could try to distinguish between the type of car mode used.

My rationale was:

Since cost is directly dependent on the transport mode, if we are able to discern the type of car used, we can avoid assigning a generalized cost to all the instances where a car mode was used. Since we don't use Mode_confirm for cost estimation anymore, I assumed that demographic information could help us with a rough idea of the type of car transport used.

shankari commented 9 months ago

ah so this was for the cost estimate of the alternatives, got it! However, if you are building a mini-model to estimate the trip travel mode, you should also validate its (mini) correctness.

rahulkulhalli commented 9 months ago

sad trumpet noises *

It seems that the heuristics I thought of don't match the actual confirmed distribution. I'd like to also give some description of the heuristics themselves:

Car distance ratio:
    compute ratio of distance traveled per mode
    if the ratio of car usage is >= 0.9, label as "Car"
    if it is < 0.9, label as "Uber"
    if car is not present in the section modes at all, return "None"

Demographic info:
    Use the number of motor vehicles owned and the number of residents with a valid license:
    If n_motor_vehicles >= n_users_with_license, label as "Car"
    If n_motor_vehicles < n_users_with_license, label as "Uber"
    if car is not present in the section modes at all, return "None"

Demographics + car distance ratio:
    If the ratio of car usage is >= 0.9 AND n_motor_vehicles >= n_users_with_license, label as "Car"
    Else, label as "Uber"
    if car is not present in the section modes at all, return "None"

rahulkulhalli commented 9 months ago

With your permission, I'd like to try and model a mini logistic model for determine the car type using Mode_confirm as a target variable. Once the model is trained, we may discard the model and use the underlying coefficients for future use. I wish to only see if I can extract any sort of meaningful information from the model.

rahulkulhalli commented 9 months ago

Models are training. Will report on the results when available! 🤞

rahulkulhalli commented 9 months ago

The multinomial logistic model performs pretty poorly with the best score across 3 CV folds to be a weighted F1 of about ~0.53.

The confusion matrix entries for the test data reveal some information about the individual classes - Gas car, drove alone has an accuracy of ~51%, Gas car, with others has an accuracy of ~35%, and Uber/Lyft has an accuracy of ~72%.

rahulkulhalli commented 9 months ago

The Gradient-boosted classifier, on the other hand, does much better.

Gas car, drove alone is predicted correctly ~80% of the time, Gas car, with others is predicted correctly ~84% of the time, and funnily enough, Taxi/Uber is what this model struggles the most with, with an accuracy of 50%.

rahulkulhalli commented 9 months ago

Now, we come to the Random forest model. It performs on par with the gradient-boosted model, and in fact, it does better on Uber/Lyft class.

Accuracy for Gas car, drove alone: 80%, Accuracy for Gas car, with others: 78%, Accuracy for Taxi/Uber: 60%. Admittedly, it's performance for the last class isn't as good as the logistic model (with an accuracy of 72% on Uber), but it may be a good 'in-between' model, retaining the training time of the logistic model and the performance of the gradient-boosted classifier.

One thing I observed is that Zack's notebook only uses the prepilot data for training. Is this what I should also stick to? Or should I expand this analysis on the entire Stage, 4c, fc, cc,... prepilot data? The prepilot has 235 users, whereas the survey consists of 202 users and they have 170 common users among them.

shankari commented 9 months ago

Gas car, drove alone is predicted correctly ~80% of the time, Gas car, with others is predicted correctly ~84% of the time, and funnily enough, Taxi/Uber is what this model struggles the most with, with an accuracy of 50%.

That is interesting. Couple of thoughts:

given that uber % is very small (< 5%) in the original dataset, how much difference does it likely to make to separate it out? Can we just use a weighted average for the cost instead? can you sketch out how you plan to use this feature going forward so that we can assess whether it is reasonable or not?
if we do want to use the output of this model as the input to another model (the replaced mode model), we will need to take the error rate of this model into account. As we know, propagating inferences is fairly complicated. I have not seen prior instances of such "stacked models".
- if we need to make the distinction (aka after answering the first question) it would be good to see if there are known examples of it, and how well they work. Or is this another research topic?

One thing I observed is that Zack's notebook only uses the prepilot data for training. Is this what I should also stick to? Or should I expand this analysis on the entire Stage, 4c, fc, cc,... prepilot data? The prepilot has 235 users, whereas the survey consists of 202 users and they have 170 common users among them.

I bet it only uses pilot data (not prepilot) for testing. The actual prepilot only has 13 users. If the variable is called prepilot it is wrong. The stage users were not part of the ebike programs, and did not have access to ebikes. So their replaced modes are somewhat irrelevant. That is probably why he excluded them

rahulkulhalli commented 9 months ago

Sure, @shankari.

Before sketching out my plan on using the results from these models, I would like to explain the experiment set-up:

The current models are trained on all data points that have at least one car section mode reading. The indepedent variables that this model uses are:

'income_category', 'n_motor_vehicles', 'n_residents_with_license', 'is_male', 'age', 'sin_HOD', 'sin_DOM', 'cos_HOD', 'cos_DOM', 'car_distance_miles'

income_category is one-hot encoded
Numerical features are normalized using StandardScaler
HOD = hour of day
DOM = day of month
car_distance_miles is the total distance traveled by the car mode (in miles)

Once we obtain predictions for Gas car, drove alone, Gas car, with others, and Taxi/Uber, we can use them for weighing the costs for the cost-time baseline model.

My main inspiration for the nested model was derived from nested logit models, which are used fairly commonly in mode choice modeling - https://transp-or.epfl.ch/courses/ANTWERP07/08-nested.pdf

This is a rough illustration of my idea:

rahulkulhalli commented 9 months ago

I think using a weighted average for the cost would be helpful in creating a baseline model. All our future incremental work could be compared to this model's performance. I will go ahead with the baseline hourly car expense ($0.6/mile) and create the baseline replacement mode model.

I concede that we may be adding too many cogs in the machine at once, which might make it difficult to trace any potential point of failure or optimization. To the best of my recollection, most of the literature I've read for travel mode choice preemptively has an idea of the type of car transport used. However, I think it is a good idea to assume uniform cost for all car modes for a start and create the baseline model first.

rahulkulhalli commented 8 months ago

For the baseline model, I chose to go with the following features (the existing replaced mode mapping is being used here. They are being collapsed into a set of base modes: 'car', 's_car (shared car)', 's_micro (shared micromobility)', 'p_micro (personal micromobility)', 'walk', 'ridehail (Uber/taxi)', 'ebike', 'no_travel'.

X = modeling_data[['estimated_cost', 'section_distance_argmax_miles', 'duration', 'distance_miles',
    'age', 'n_residence_members', 'n_residents_with_license', 'is_male', 'sin_HOD', 'sin_DOM',
    'cos_HOD', 'cos_DOM', 'income_category', 'n_motor_vehicles', 'is_weekend', 'section_mode_argmax'
]].copy()

Y = modeling_data[['Replaced_mode']].copy()

Tried using the logistic classifier, but got the following stats:

Best validation F1 for Logistic model: 0.27480382039025203
Test F1 for Logistic model: 0.2744886353547929

That does not look promising at all. Either the logistic models are not converging (since maxiters was set to 2000 only), or the model parameters are insufficient to model the target variable. The confusion matrix gives us some more information:

The per-class accuracies are:

{'Not a Trip': 1.0, 'Other': 0.6079136690647482, 'Unlabeled': 0.4606741573033708, 'car': 0.2076271186440678, 'ebike': 0.6923076923076923, 'no_travel': 0.16088154269972452, 'p_micro': 0.27228525121555913, 'ridehail': 0.42045454545454547, 's_car': 0.22531939605110338, 's_micro': 0.16666666666666666, 'transit': 0.25391849529780564, 'walk': 0.3602150537634409}

Clearly, the logistic model is struggling. It also struggles with the majority classes 'car' and 'no_travel'.

rahulkulhalli commented 8 months ago

Using the random forest gives us a remarkable jump in performance - the best weighted F1 goes up from 0.27 to 0.72!

Best validation F1 for RF model: 0.7216160907675281
Test F1 for RF model: 0.7234734296071574

The confusion matrix shows us the model's prowess - much stronger confidences per class and much lesser inter-class variance as compared to the logistic model.

{'Not a Trip': 1.0, 'Other': 0.6942446043165468, 'Unlabeled': 0.6179775280898876, 'car': 0.7974576271186441, 'ebike': 0.46153846153846156, 'no_travel': 0.7851239669421488, 'p_micro': 0.7358184764991896, 'ridehail': 0.6363636363636364, 's_car': 0.5598141695702671, 's_micro': 0.16666666666666666, 'transit': 0.6050156739811913, 'walk': 0.6666666666666666}

It still struggles in the shared micro-mobility and e-bike (~46%), but it is still a marked improvement over the previous model. However, that could be explained because of the dearth of the samples found in the training set:

s_micro and ebike are the two least frequently occurring labels in the dataset.

Trying the gradient-boosted model now.

shankari commented 8 months ago

@rahulkulhalli I am not sure you are using the logistic regression model correctly. I see the featurization, but can you expand on how you you fill the row? Note our prior discussion around choice modeling, although you have not recorded details on the alternative trips. https://github.com/e-mission/e-mission-docs/issues/978#issuecomment-1739885897

Using the random forest gives us a remarkable jump in performance - the best weighted F1 goes up from 0.27 to 0.72!

If you are training the random forest on the existing replaced mode labels, you might want to write out the process carefully to make sure that we are not leaking information and that it is actually usable in the production scenario in terms of the data that we have available. Because that is indeed a very nice result :tada:

s_micro and ebike are the two least frequently occurring labels in the dataset.

EDIT: As an even simpler baseline, have you considered simply assigning the replaced mode in the label proportions? That is what Denver CASR suggested.

So you have 100 labeled trips with the following proportions of replaced mode: 0.5 car, 0.3 bus, 0.2 walk You have 500 unlabeled trips, you assign them randomly as 0.5 car, 0.3 bus, 0.2 walk How far off are you?

If the label distribution is very skewed (as in the case above), this may already give you good results. For example, if the replaced mode was 90% car, and you wrote an "algorithm" that just set all the predicted labels to car, you would get an accuracy of 90%

rahulkulhalli commented 8 months ago

So you have 100 labeled trips with the following proportions of replaced mode: 0.5 car, 0.3 bus, 0.2 walk You have 500 unlabeled trips, you assign them randomly as 0.5 car, 0.3 bus, 0.2 walk How far off are you?

I think that would depend on whether the 500 unlabeled trips come from the same distribution. If the distribution changes, we might be adding inductive bias. However, this is an interesting experiment - I can definitely try this on our dataset and report the results.

As an even simpler baseline, have you considered simply assigning the replaced mode in the label proportions? That is what Denver CASR suggested.

I may be wrong, but are we referring to the 0.5, 0.3, 0.2 experiment you'd mentioned earlier here? If yes, I'm currently writing the code for that experiment. 😄

If you are training the random forest on the existing replaced mode labels, you might want to write out the process carefully to make sure that we are not leaking information and that it is actually usable in the production scenario in terms of the data that we have available. Because that is indeed a very nice result 🎉

Thank you! Yes, I will cross-check and verify that there is no information leakage in the code. I will design robust, scalable preprocessing and inference modules so that It can be used seamlessly on production.

rahulkulhalli commented 8 months ago

This is the label distribution we're working with:

Using this, I conduct 100 experiments with the following idea:

Permute the rows of the Y_test dataframe at every iteration
Sample from the label distribution and create the predictions vector. Using the GT and the predictions, compute the F1-score for the iteration.

The average test F1 using this setup is 0.19.

The reason why I permuted the rows of Y_test was to add make the predictions more stochastic - we're not guaranteed to have the same label order at test time. If we're just blindly assigning predictions to samples without looking at the samples' attributes, then order should not matter.

shankari commented 8 months ago

Interesting. So the naive "just sample from the same random distribution" approach is pretty bad, even worse than logistic regression. The logistic regression might improve after you use the alternate modes, though

rahulkulhalli commented 8 months ago

I will start looking into that 😄

rahulkulhalli commented 8 months ago

Logging yesterday's work since I left for classes and forgot to update the thread!

For alternate modes, I figured that it'd be a good idea to start investigating the available_modes attribute. I faced two challenges:

How may I incorporate this attribute into the existing training data?
How do I map the existing available modes in a meaningful way?

To these questions, I proposed two ideas to myself:

Since we're thinking about alternate modes that the person may have at their disposal (similar to this: set A = {current mode} and set B = {U - A}), it might be a good idea to remove the current argmax section mode from the set of modes, so that we are left with all modes OTHER than the favored mode that are available to the user.
I came up with the following mapping scheme to standardize the section_modes and available_modes:

section_mapping = {
    'car': 'car',
    'walking': 'walk',
    'no_sensed': 'no_travel',
    'bicycling': 'p_micro',
    'train': 'transit',
    'bus': 'transit'
}

Now, when we read the set of the user's available modes, we remove the mode that was sensed and map the remaining modes using the mapping above. However, there are a few cases where the user has no other available modes other than the sensed mode. In these cases, the set of alternate modes is empty. To counteract this, I include another entry in the alternate mode called "none".

rahulkulhalli commented 8 months ago


class AlternateModeEncoder:
    def __init__(self):
        self.keyset = ['transit','s_car','p_micro','walk','ridehail','no_travel','s_micro','car','none']
        self.mapper = dict(zip(self.keyset, range(len(self.keyset))))
        self.features_out = None

    def fit(self, X: pd.DataFrame):
        if X is None:
            raise AttributeError("Null dataframe")
        if X.shape[0] == 0:
            raise AttributeError("Empty dataframe")

        mapped = list()
        for _, row in X.iterrows():
            mode_vector = [0 for _ in range(len(self.keyset))]
            for m in row['alt_modes']:
                mode_vector[self.mapper[m]] = 1

            mapped.append(dict(zip(self.keyset, mode_vector)))

        self.features_out = pd.DataFrame(mapped, index=X.index)
        return self.features_out

A lot of optimization possible here, but I reckon this is a good starting point.

This is the head of the encoded feature dataframe:

rahulkulhalli commented 8 months ago

Using the new data, the logistic model improves in performance.

Confusion Matrix:

The diagonals definitely have much higher values, which shows that the true positives have gone up. Incorporating my idea of alternate modes definitely seems to work favorably for the model.

The updated logistic model's per-class accuracies:

{'Not a Trip': 1.0, 'Other': 0.7581227436823105, 'Unlabeled': 0.602996254681648, 
'car': 0.24798643493005512, 'ebike': 0.9230769230769231, 'no_travel': 0.28059536934950385,
 'p_micro': 0.34683954619124796, 'ridehail': 0.6931818181818182, 's_car': 0.28604651162790695, 
's_micro': 0.5, 'transit': 0.43573667711598746, 'walk': 0.396505376344086}

Compared to the previous performance:

{'Not a Trip': 1.0, 'Other': 0.6079136690647482, 'Unlabeled': 0.4606741573033708, 
'car': 0.2076271186440678, 'ebike': 0.6923076923076923, 'no_travel': 0.16088154269972452, 
'p_micro': 0.27228525121555913, 'ridehail': 0.42045454545454547, 's_car': 0.22531939605110338, 
's_micro': 0.16666666666666666, 'transit': 0.25391849529780564, 'walk': 0.3602150537634409}

There is an improvement in every class' accuracy.

rahulkulhalli commented 8 months ago

The random forest's performance also increased slightly!

Confusion matrix:

The new per-class accuracies:

{'Not a Trip': 1.0, 'Other': 0.7906137184115524, 'Unlabeled': 0.7116104868913857, 
'car': 0.847392963119966, 'ebike': 0.6153846153846154, 'no_travel': 0.8379272326350606, 
'p_micro': 0.7909238249594813, 'ridehail': 0.6477272727272727, 's_car': 0.6337209302325582, 
's_micro': 0.3333333333333333, 'transit': 0.6394984326018809, 'walk': 0.7043010752688172}

As compared to the previous per-class performance:

{'Not a Trip': 1.0, 'Other': 0.6942446043165468, 'Unlabeled': 0.6179775280898876, 
'car': 0.7974576271186441, 'ebike': 0.46153846153846156, 'no_travel': 0.7851239669421488, 
'p_micro': 0.7358184764991896, 'ridehail': 0.6363636363636364, 's_car': 0.5598141695702671, 
's_micro': 0.16666666666666666, 'transit': 0.6050156739811913, 'walk': 0.6666666666666666}

A consistent increase in accuracies is noted her as well.

rahulkulhalli commented 8 months ago

Adding weather data improves the random forest's test F1 score by another 2%, which takes it to 80%!

The updated CM:

And the updated per-class test performance:

{'Not a Trip': 1.0, 'Other': 0.9028776978417267, 'Unlabeled': 0.846441947565543, 
'car': 0.8732513777024162, 'ebike': 0.46153846153846156, 'no_travel': 0.86438809261301, 
'p_micro': 0.7909238249594813, 'ridehail': 0.6590909090909091, 's_car': 0.6216530849825378, 
's_micro': 0.25, 'transit': 0.6394984326018809, 'walk': 0.7701612903225806}

Other jumps from 0.79 -> 0.9, Unlabeled improves from 0.71 -> 0.84, there is a slight improvement in Car, but the ebike and s_micro performance drastically reduces from 0.61 -> 0.46 and 0.33 -> 0.25 respectively. Walk gains a slight performance boost, going from 0.7 -> 0.79

I can see that the model may be improving in overall F1, but we aren't uniformly improving - the improvement in some labels is coming at the cost of a decrease in performance in some other labels. The train-test splits used across both experiments were the same, and the only thing that was added was the weather attributes:

['temperature_2m (°C)', 'relativehumidity_2m (%)', 'dewpoint_2m (°C)', 'rain (mm)',
    'snowfall (cm)', 'cloudcover (%)', 'windspeed_10m (km/h)']

According to RF's feature importance, dew point, temperature, and relative humidity are influential in the model's decision-making process. So what would happen if I remove those features and re-train the model?

rahulkulhalli commented 8 months ago

Okay, so my intuition was somewhat on the right track - removing the unnecessary weather attributes not only maintained the F1 score, but also boosts the performance of the affected labels. 💪

{'Not a Trip': 1.0, 'Other': 0.8992805755395683, 'Unlabeled': 0.8352059925093633, 
'car': 0.8753709198813057, 'ebike': 0.3076923076923077, 'no_travel': 0.8638368246968027, 
'p_micro': 0.7974068071312804, 'ridehail': 0.6420454545454546, 's_car': 0.6181606519208381, 
's_micro': 0.25, 'transit': 0.6394984326018809, 'walk': 0.760752688172043}

s_micro still stays at 0.25, but all the other affected labels seem to now be stable.

rahulkulhalli commented 8 months ago

In addition to the above-mentioned weather variables, I dropped some more variables that were the least important to the model:

total_feature_set.remove('is_overnight_trip')
total_feature_set.remove('is_male')
total_feature_set.remove('start:is_weekend')
total_feature_set.remove('end:is_weekend'))

New label performances:

{'Not a Trip': 1.0,
 'Other': 0.8992805755395683,
 'Unlabeled': 0.8426966292134831,
 'car': 0.8732513777024162,
 'ebike': 0.38461538461538464,
 'no_travel': 0.8627342888643881,
 'p_micro': 0.8022690437601296,
 'ridehail': 0.6647727272727273,
 's_car': 0.6204889406286379,
 's_micro': 0.25,
 'transit': 0.6489028213166145,
 'walk': 0.7620967741935484}

shankari commented 8 months ago

Couple of high-level comments:

For alternate modes, I figured that it'd be a good idea to start investigating the available_modes attribute. I faced two challenges:

This may be fine for the random forest model, but I am not sure you are using the alternate modes correctly for the logistic regression model. It is not 100% clear from the description above how incorporate the alternate modes into the feature set. Can you clarify? What are the coefficients of the logistic regression model, for example?

but the ebike and s_micro performance drastically reduces from 0.61 -> 0.46 and 0.33 -> 0.25 respectively. Walk gains a slight performance boost, going from 0.7 -> 0.79

The number of labels for s_micro as a replaced mode is likely very small (you should verify). I am not sure that you will ever get a great result for it. Does combining small% labels into something like "other" help? With random forest, I am not sure that it will, since then the "other" labels will just have more complex rules for their prediction. But it may help with logistic regression.

That brings up a higher-level question on the use of this model. As I am sure you know, we deploy OpenPATH in multiple locations. The dataset you are using is from only one of those locations. Since this is a behavior/choice model and not a sensor-based model, we need to think through how we plan to train and deploy this model for different locations.

rahulkulhalli commented 8 months ago

Today, I plan on restructuring the pipeline for ease of reproducibility. Specifically, I will add documentation and comments wherever necessary and commit the code to my forked repository, allowing for ease of review.

shankari commented 8 months ago

@rahulkulhalli can we address the first two questions above before restructuring and committing?

rahulkulhalli commented 8 months ago

Definitely, Dr. Shankari. I am formulating my responses to your comments right now. I will not start restructuring without receiving a go-ahead from you.

rahulkulhalli commented 8 months ago

This may be fine for the random forest model, but I am not sure you are using the alternate modes correctly for the logistic regression model. It is not 100% clear from the description above how incorporate the alternate modes into the feature set. Can you clarify? What are the coefficients of the logistic regression model, for example?

Definitely, Dr. Shankari. I use the available_modes feature for determining alternate modes. I will try and explain using some code and my rationale behind the implementation:


mode_mapping = {
    'Public transportation (bus, subway, light rail, etc.)': 'transit',
    'Get a ride from a friend or family member': 's_car',
    'Bicycle': 'p_micro',
    'Walk/roll': 'walk',
    'Taxi (regular taxi, Uber, Lyft, etc)': 'ridehail',
    'None': 'no_travel',
    'Shared bicycle or scooter': 's_micro',
    'Rental car (including Zipcar/ Car2Go)': 'car',
    'Skateboard': 'p_micro',
    'Do not have vehicle ': 'no_travel'
}

First, I map the available modes into our target feature labels. There are some reasons why I don't directly map from the available mode to the section mode:

Public transport is a combined term (bus OR train), whereas we have separate train and bus section modes.
"Taxi" and "Skateboard" cannot be mapped to any of our sensed modes directly.
In my opinion, "Do not have vehicle" and "no_sensed" do not share the same meaning.
The target label space has a much richer

Instead, what I chose to do was to map both the section modes as well as the available modes into the same label space as the target labels. This allows for a closer mapping for the available modes.

section_mapping = {
    'car': 'car',
    'walking': 'walk',
    'no_sensed': 'no_travel',
    'bicycling': 'p_micro',
    'train': 'transit',
    'bus': 'transit'
}

Similarly, the argmax-ed section modes are also mapped to the target label space.

Once both the features are mapped to a normalized space, I remove the current argmax-ed mapped feature from the mapped available modes and return them as a binary feature vector.

As a concrete example,

argmax_section_mode = "car"
mapped_section_mode = ["car"]
available_modes = ['Public transportation (bus, subway, light rail, etc.)', 'Get a ride from a friend or family member', 'Shared bicycle or scooter', 'Walk/Roll', 'Taxi (regular taxi, Uber, Lyft, etc)']
mapped_available_modes = ['transit', 'walk', 's_micro', 'ridehail']

# Remove the mapped_section_mode from the mapped_available_modes. What we are left with would be the modes that this user would use if their current mode was not available.

# Convert the mapped modes to a feature vector, where 1 indicates the presence of a mode and 0 indicates the absence.
An example of how this vector would look like: [0, 1, 0, 1, 1, 0, 0, 0, 0]

rahulkulhalli commented 8 months ago

The number of labels for s_micro as a replaced mode is likely very small (you should verify). I am not sure that you will ever get a great result for it. Does combining small% labels into something like "other" help? With random forest, I am not sure that it will, since then the "other" labels will just have more complex rules for their prediction. But it may help with logistic regression.

Yes, I agree. Not a trip, s_micro, and ebike have the least occurrences in the dataset. I will try combining it with them with Other and check the logistic model's performance and report on what I observe in the parameters as well as the per-class performance.

shankari commented 8 months ago

Definitely, Dr. Shankari. I use the available_modes feature for determining alternate modes. I will try and explain using some code and my rationale behind the implementation:

I saw this in the previous commits as well (https://github.com/e-mission/e-mission-docs/issues/978#issuecomment-1759726679). What I want to know is how you are using them in the features after the mapping.

rahulkulhalli commented 8 months ago

After training the new logistic model, the performance goes up slightly.

To find which model parameter is the most sensitive, we could try perturbing the inputs of each feature and measure the difference In performance (score_with_perturbation - score_without_perturbation). The highest sensitivity might be the feature that the model deems more significant.

e-mission / e-mission-docs

Predicting replaced mode for trips with no inferred labels #978