Open rahulkulhalli opened 1 year ago
Just cloned Zack's public-dashboard fork (replacement model branch). Surveying the notebooks right now.
(Edit) Link to repo: https://github.com/zackAemmer/em-public-dashboard/tree/replacement-model
So I'm starting with the replaced_model_data_processing
notebook. In it, the code author loads two dictionaries. What are these two dictionaries? Let's check. I'm following the trail and checking the mapping_dictionaries
notebook. When I print the contents of the re_dict
, I get this:
{'drove_alone': 'Gas Car, drove alone', 'e_car_drove_alone': 'E-car, drove alone', 'work_vehicle': 'Gas Car, drove alone', 'bus': 'Bus', 'train': 'Train', 'free_shuttle': 'Free Shuttle', 'train,_bus and walk': 'Train', 'train_and pilot e-bike': 'Train', 'taxi': 'Taxi/Uber/Lyft', 'friend_picked me up': 'Gas Car, with others', 'carpool_w/ friend to work': 'Gas Car, with others', 'friend_carpool to work': 'Gas Car, with others', 'carpool_to work': 'Gas Car, with others', 'friend/co_worker carpool': 'Gas Car, with others', 'carpool_to lunch': 'Gas Car, with others', 'carpool': 'Gas Car, with others', 'carpool_for lunch': 'Gas Car, with others', 'carpool_lunch': 'Gas Car, with others', 'shared_ride': 'Gas Car, with others', 'e_car_shared_ride': 'E-car, with others', 'bikeshare': 'Bikeshare', 'scootershare': 'Scooter share', 'pilot_ebike': 'E-bike', 'e-bike': 'E-bike', 'walk': 'Walk', 'skateboard': 'Skate board', 'bike': 'Regular Bike', 'the_friend who drives us to work was running errands after the shift before dropping me off. not a trip of mine.': 'Not a Trip', 'not_a_trip': 'Not a Trip', 'no_travel': 'No Travel', 'same_mode': 'Same Mode'}
Seems like the keys are the replaced modes and the values are the cleaned modes. Straightforward enough for now.
@shankari The notebook uses a file called "Can Do Colorado eBike Program". Seems to incorporate some socio-economic data in the model as well. It's used in the following way:
socio_data = pd.read_csv('./Can Do Colorado eBike Program - en.csv')
socio_data.rename(columns={'Unique User ID (auto-filled, do not edit)':'user_id',
'Please identify which category represents your total household income, before taxes, for last year.':'HHINC',
'How many motor vehicles are owned, leased, or available for regular use by the people who currently live in your household?':'VEH',
'In which year were you born?':'AGE',
'Including yourself, how many people live in your home?':'HHSIZE',
'How many children under age 18 live in your home?':'CHILDREN',
'What is your gender?':'GENDER',
'If you were unable to use your household vehicle(s), which of the following options would be available to you to get you from place to place?':'available_modes',
'Are you a student?':'STUDENT',
"Including yourself, how many people have a driver's license in your household?":'DRIVERS'}, inplace=True)
I can't see this file anywhere in the repo. Would you happen to have it?
The file is part of the dataset. As you can see from the columns, it contains user specific information and should not be checked in as part of the repo. It will also be different for different programs, which is another reason why it should not be checked in to the repo.
I was inspecting this code block:
if "mode_confirm" in expanded_ct.columns:
expanded_ct['Mode_confirm']= expanded_ct['mode_confirm'].map(dic_re)
if study_type == 'program':
# CASE 2 of https://github.com/e-mission/em-public-dashboard/issues/69#issuecomment-1256835867
if 'replaced_mode' in expanded_ct.columns:
expanded_ct['Replaced_mode']= expanded_ct['replaced_mode'].map(dic_re)
else:
print("This is a program, but no replaced modes found. Likely cold start case. Ignoring replaced mode mapping")
else:
print("This is a study, not expecting any replaced modes.")
@shankari, are all the non-null instances of 'replaced_mode' labeled by the users?
This is the count distribution for the existing replaced modes in the dataset:
Is the "Unlabeled" category the one we intend to focus on?
@rahulkulhalli,
are all the non-null instances of 'replaced_mode' labeled by the users?
Is the "Unlabeled" category the one we intend to focus on?
somewhat. Note that the "Unlabeled" category can also have inferred labels from the user models. You may have to change the data pre-processing to fix this (again, don't rely too much on this formulation, it is a starting point, not something that I have reviewed).
What we really really want to focus on are trips with no inferred labels, or inferrred replaced modes that have a low probability/confidence (e.g. under 0.25). The inferred replaced modes come from the user-specific models.
We can also consider trying to handle unlabeled trips that have a high confidence inferred replaced mode from the user-specific model, but use this more generic model as an ensemble with it. But that is a next-level enhancement.
However, we cannot use the current replaced mode labels as a feature since they are already used in the user-specific models. We cannot use any of the user labels (mode_confirm
, purpose_confirm
or replaced_mode
) because they are the Y labels in the training set for the user-specific models
Understood.
It seems that the author is replacing all the NaNs in replaced_mode with "Unlabeled".
expanded_ct.replaced_mode.isna().sum() # prints 3268
# Replace the null instances with "Unlabeled"
expanded_ct['replaced_mode'] = expanded_ct['replaced_mode'].fillna('Unlabeled')
expanded_ct.loc[expanded_ct['replaced_mode'] == 'Unlabeled', 'Replaced_mode'] = "Unlabeled"
The value counts above also show 3268 values corresponding to Unlabeled
.
@shankari Have we thought about incorporating weather data in our predictions? We could analyze what the correlation between the weather conditions and the replaced modes are. For instance, if it was raining, I would opt for a Gas Car instead of walking to my destination.
We have discussed weather from time to time but it has not been incorporated yet, primarily due to the lack of a local archived dataset that we could find.
The notebook that you see is everything that we have done in this space. Feel free to experiment with weather and any other features that you can find a good historical dataset for.
I know of a very reliable open-source weather API platform: https://open-meteo.com/en/docs/historical-weather-api
I will try to incorporate this info with our data and see what results we may get.
Follow-up question: OpenMeteo provides historical weather data for a particular lat-lng as well. Just confirming whether it is an acceptable parameter to pass to the API and doesn't violate our data privacy laws.
we should check their privacy policy and whether they log incoming requests. I am not sure that the lat/lon is a lot more accurate than just the city level weather, and the city-level weather also gives us a lot more opportunity for caching data and not abusing their API on production. So my vote is to start with city-level.
That's a great point - I also vote on starting with city-level weather info. I shall start writing a small notebook/script to join weather data with our existing data.
Started with forking Zack's code:
I've narrowed-down the following historical weather variables that may be viable for analysis:
Reasoning behind the choices:
Since OpenMeteo also offers variables like solar radiation and AQI, I think we can also experiment with these additional variables.
Expectedly so, I get the following error while trying to retrieve data from the OpenMeteo API endpoint:
requests.exceptions.SSLError: HTTPSConnectionPool(host='archive-api.open-meteo.com', port=443): Max retries exceeded with url: /v1/archive?latitude=39.7392&longitude=-104.9847&start_date=08%2F15%2F2016&end_date=12%2F31%2F2022&hourly=temperature_2m%2Crelativehumidity_2m%2Cdewpoint_2m%2Cprecipitation%2Ccloudcover%2Cwindspeed_10m%2Cwindgusts_10m (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)')))
I'm looking into the feasibility of using this to pass SSL validation.
Update: Solved. I didn't implement any SSL workarounds but directly downloaded the CSV file from OpenMeteo instead. đź§
Attempting to join the weather data with our dataset. Before merging, I decided to check how the start_loc was distributed:
Most of the points are concentrated in North America, but the distribution is all across the country. I've download Denver-specific weather data. I can either:
Clocking out for the day. I've decided to focus only on the Colorado data points.
I've finished merging the travel data and the weather data and I'm currently checking for visually discernible relationships between the new features and the target variable.
This plot shows us the distribution of the replaced modes w.r.t. temperature (in C). I will look at this plot in-depth tomorrow and post my observations here.
What is the current list of features?
@shankari I'm still in the feature exploration phase, so I haven't finalized the list of features for primary modeling.
That being said, the current list of weather parameters are:
['temperature_2m (°C)', 'relativehumidity_2m (%)', 'dewpoint_2m (°C)',
'precipitation (mm)', 'cloudcover (%)', 'windspeed_10m (km/h)',
'windgusts_10m (km/h)']
Also, we may include some AQI parameters as well. OpenMeteo does support AQI; unfortunately, however, the support started in 2021. There is an open-source service called OpenAQ that provides historical AQIs given a generic location name.
Clocking out for the day. I've decided to focus only on the Colorado data points.
I've finished merging the travel data and the weather data and I'm currently checking for visually discernible relationships between the new features and the target variable.
This plot shows us the distribution of the replaced modes w.r.t. temperature (in C). I will look at this plot in-depth tomorrow and post my observations here.
Commenting on the box-plot above:
I also have some observations about the new features' inter-correlation:
Some interesting density plots:
Part deux:
Some inferences that I draw from the distribution plots above:
Ideal traveling conditions seem to be during 10-30 km/h wind gusts
This is not very intuitive. I suspect it is a bad quality input.
Agreed. Again, the weather data that I've collected is what OpenMeteo points to when I input "Denver" (lat: 39.7392, lng: -104.9847), so the granularity might not be the best. We could:
I would vote for just dropping it from the feature list at this point, or convert into a categorical variable with a 2-3 bins (e.g. 0-30, 30-60). Intuitively, big wind gusts should affect use of active transportation and maybe even willingness to travel, but that might also be captured in the related wind speed variable.
I also vote for dropping the feature momentarily and later try a round of modeling with the feature either included or binned.
Additionally, I like my idea of computing the cluster center of the geospatial data and using that abstracted location as the source of weather information. It definitely won't skew the pertinent information by much, although it might be a better representation of the demographic data.
The plot below is the cluster centroid as a function of the lat-lng input pairs.
Update: Changed the weather info data so that it now originates from the red-colored cross above (after reverse geocoding, that place is: Conifer, CO 80433. The weather data has changed a bit, but not to the point where it makes a drastic difference.
Leaving for class now. I have the following things on my plate after getting back:
Update: Changed the weather info data so that it now originates from the red-colored cross above (after reverse geocoding, that place is: Conifer, CO 80433. The weather data has changed a bit, but not to the point where it makes a drastic difference.
Leaving for class now. I have the following things on my plate after getting back:
Replace the precipitation variable with its individual components - rainfall and snow.Remove wind gusts- Build a v0.0.1 model with start lat-lng, weather parameters, and travel times
- Discuss with Jack tomorrow about psychological considerations for factors relevant for replaced mode predictions
However, we cannot use the current replaced mode labels as a feature since they are already used in the user-specific models. We cannot use any of the user labels (mode_confirm, purpose_confirm or replaced_mode) because they are the Y labels in the training set for the user-specific models
@shankari With reference to the comment made above, is it valid to use section_modes
and section_distances
in the feature set?
@rahulkulhalli yes, sensor-based inferences are based on the location traces and don't use prior labels at all, so they are fair game. Note that, as we know from the discussions around MobilityNet, they may not be super accurate.
Build a v0.0.1 model with start lat-lng, weather parameters, and travel times
As we discussed earlier, the standard econometric mode choice models use time and cost.
Update 1: Had a very interesting talk with @JGreenlee and discussed some important factors that might be worth considering when choosing a travel mode replacement. My notes:
1. Purpose and destination may be important factors to consider (since they could be factors to indicate anticipation). I informed Jack that we aren't using the purpose attribute.
2. An aggregate gas price/public transport fee should be enough because they haven't fluctuated by a lot over the years
3. Time of day is also very important
4. Traffic conditions are also important - for e.g., a user may want to skip using a car in peak rush time and opt for the bus instead.
Update 2: I also just read this paper from students at UIC surrounding travel mode choice modeling using econometric data. It describes their mode choice modeling approach using gradient-boosted trees and a neural network. What was more important for me was to see what data they used - as it turns out, they use publicly-available dataset from CMAP (Chicago Metropolitan Agency for Planning) and use demographic information from it. These are some of the features that they use:
While reading Zack's original analysis, I also noticed that we have some demographic information that could be used to model the cost
aspect. I don't have access to this data, but I managed to find something similar to it here (thank you, Natalie!)
@shankari Is this the right dataset? If not, may I know where I can procure it from?
As for Colorado traffic data, that is available here. It is publicly released by DRCOG (Denver Regional Council of Governments) and has 24-hour traffic count data from 2010.
@rahulkulhalli as I said during our conversation on Teams, mode choice modeling has an extensive literature in the transportation world. While @JGreenlee has a background in psychology, I don't think he has worked on travel behavior modeling before.
Using demographics is standard - please see my note on Teams when I suggested this project earlier.
Predict the replaced mode for trips with no inferred labels by trying to build a mode choice model using demographics
@shankari Thank you for the clarification! Is this where I can find the demographic data from?
I am also pondering about how the section_modes and section_distances can be incorporated appropriately. My initial thought is to split the a single row with the combined section_modes and section_distances into multiple rows with single entries.
if you want to use sensed modes, you can:
BTW, I noticed that you had already asked about demographics:
I can't see this file anywhere in the repo. Would you happen to have it?
I answered the question asked (it will not be in the repo) but not the one that was unasked (where is it then?!) my apologies!
@shankari, while parsing the demographics CSV, I can see that there are duplicate records for some users. Zack did the following:
socio_data = socio_data.sort_values(by=['user_id', 'Timestamp'])
socio_data.drop_duplicates(subset=['user_id'], keep='last', inplace=True)
He sorted the users by ascending order of their timestamps and removed all but their last entry. Should I also follow this methodology or is there a different strategy you'd like me to use?
@shankari Zack also mapped the discrete hour_of_day
and month_of_year
variables onto a cyclical sine and cosine function. I see no problem with the choice of embedding, but I do have a question - is the survey timestamp indicative of the time at which the survey was captured? If so, does it make sense to include the survey timing info in the feature set? I ask this because I'm not sure how the time of survey makes intuitive sense in predicting what the replaced mode could be.
@rahulkulhalli please check the code carefully. I would assume that Zack mapped the hour_of_day and month_of_year of the trip as a model parameter. As we discussed, the demographic survey is a one-time request while installing the app/onboarding.
Oops, I missed that. Thank you for clarifying!
data.Replaced_mode = data.Replaced_mode.replace(
['Gas Car, drove alone', 'Gas Car, with others', 'Bikeshare', 'Scooter share',
'Regular Bike', 'Skate board', 'Train', 'Free Shuttle', 'Bus', 'Walk',
'Taxi/Uber/Lyft', 'E-bike', 'No Travel'],
['car', 's_car', 's_micro', 's_micro', 'p_micro', 'p_micro', 'transit', 'transit', 'transit',
'walk', 'ridehail', 'ebike', 'no_travel']
)
@shankari Should I retain this mapping for the first baseline model? I'd like to clarify that I will NOT be using this feature in the independent feature list while training.
@rahulkulhalli what is your recommendation based on your reading of the mode choice model literature and the desired use case? What are the pros and cons of the different approaches? It would be great if you could make recommendations (backed up with justifications) that I could provide feedback on.
Since I couldn't work yesterday, I decided to do some reading. Here's an excerpt of data and modeling choices used by other authors:
The authors compare mode choice transport models using an ANN and a multinomial logit method. The data includes demographic and socioeconomic characteristics (age, gender, household, car ownership, driver license, income, education level, travel time, and Distance to destination). In this study, a multinomial logit model is used to understand the commuter’s mode choice of {car, bus and vanpool}.
Chee and Fernandez argue that factors such as gender, parking availability, reliability, and personal income do not affect mode choice
Dependent variables: Gender, Age, Education Employment Income, HH Size, Vehicle Ownership, Purpose of Trip, Travel distance, Travel time, Parking cost, Parking availability, Car Price, Fuel price, Toll cost, Comfort of car, Number of transfers, Reliability, Bus frequency, overall quality of bus service, Coverage, Affective motives, Instrumental Motives, Symbolic motives. Target: Mode {car, bus}
Kashifi et al (https://www.sciencedirect.com/science/article/pii/S2214367X22000746) show that trip distance, travelers’ age and annual income, number of cars/bicycles owned, and trip density play crucial roles in predicting mode choice for users. They use 4 travel modes: i.e., {walk, bike, public transit, and car} as target labels. They try 5 types of ML models and find that LightGBM gives best results. They also use SHAP values for interpretability.
Wang et al (https://journals.sagepub.com/doi/epub/10.1177/0361198118773556) use socio-econometric survey data collected from the Delaware Valley Regional Planning Commission. Target variables: {car (driver or passenger), biking, walking, or transit}. Interestingly, they use the Distance Matrix API from GMaps to query estimated travel time. They also use geographical features such as population density, land-use density, etc.
Li et al (https://www.researchgate.net/publication/354593912_Modeling_Intercity_Travel_Mode_Choice_with_Data_Balance_Changes_A_Comparative_Analysis_of_Bayesian_Logit_Model_and_Artificial_Neural_Networks) concern themselves with only public transport and use {airplane, HSR, bus, and train} as target labels. They use features such as gender, occupation, income, travel purpose, travel mode, travel time, and safety.
Bei et al (https://www.sciencedirect.com/science/article/pii/S2214367X23000765) propose an interesting multi-task DNN approach to jointly predict the purpose and commute mode of a trip. They use the {car or walk, cycling, car/van, and rail} targets. They incorporate socio-economic data obtained from the UK National Travel Survey (NTK) survey data from 2005 - 2016.
There is some very interesting theory mentioned in Example 2 from the book titled "Self-instructed mode choice modeling" (https://tfresource.github.io/modechoice/the-multinomial-logit-model.html. However, I don't understand it correctly. I may need your help understanding this, @shankari
The same course (https://tfresource.github.io/modechoice/estimation-chapter.html) also lays out fundamental variables that should be included whilst creating a mode choice model. Excerpt: There are six work mode choice alternatives: The drive alone mode is available for a trip only if the trip-maker’s household has a vehicle available and if the trip-maker has a driver’s license. The shared-ride modes (with 2 people and with 3 or more people) are available for all trips. Transit availability is determined based on the residence and work zones of individuals. The bike mode is deemed available if the one-way home-to-work distance is less than 12 miles, while the walk mode is considered to be available if the one-way home to work distance is less than 4 miles (the distance thresholds to determine bike and walk availability are determined based on the maximum one-way distance of bike and walk-users, respectively).
What are my observations from this literature review?
My recommendations:
['Gas Car, drove alone', 'Gas Car, with others', 'Bikeshare', 'Scooter share',
'Regular Bike', 'Skate board', 'Train', 'Free Shuttle', 'Bus', 'Walk',
'Taxi/Uber/Lyft', 'E-bike', 'No Travel'],
['car', 's_car', 's_micro', 's_micro', 'p_micro', 'p_micro', 'transit', 'transit', 'transit',
'walk', 'ridehail', 'ebike', 'no_travel']
)
These are the replaced_modes before mapping:
Some of these (such as pilot_ebike
or golf_cart
) can easily be mapped to one of our predefined labels. However, what would instances like zip-line
or time_spent on the clock at amazon
be mapped to?
Also, I heard back from Bingrong - she is going to share the draft paper for replacement_mode prediction for my reference.
These are the cost factors that have already been implemented:
So we're basically doing cost[section] = cost_factors_init[section] + (cost_factors[section] * distance[section])
However, in the previous implementation, the cost factors were most likely derived from either the mode_confirm
or replaced_mode
. If I'm not mistaken, we're not not supposed to use any of this information while creating our features. In that case, could we could use section_modes
and section_distances
(remember what @shankari said above - if working at the trip level, take the maximum or work on the section level)?
Creating this issue to document my observations, readings, and development efforts towards building a solution for predicting the replaced mode in the absence of inferred labels.