e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

Predicting replaced mode for trips with no inferred labels #978

Open rahulkulhalli opened 1 year ago

rahulkulhalli commented 1 year ago

Creating this issue to document my observations, readings, and development efforts towards building a solution for predicting the replaced mode in the absence of inferred labels.

rahulkulhalli commented 1 year ago

Just cloned Zack's public-dashboard fork (replacement model branch). Surveying the notebooks right now.

(Edit) Link to repo: https://github.com/zackAemmer/em-public-dashboard/tree/replacement-model

rahulkulhalli commented 1 year ago

So I'm starting with the replaced_model_data_processing notebook. In it, the code author loads two dictionaries. What are these two dictionaries? Let's check. I'm following the trail and checking the mapping_dictionariesnotebook. When I print the contents of the re_dict, I get this:

{'drove_alone': 'Gas Car, drove alone', 'e_car_drove_alone': 'E-car, drove alone', 'work_vehicle': 'Gas Car, drove alone', 'bus': 'Bus', 'train': 'Train', 'free_shuttle': 'Free Shuttle', 'train,_bus and walk': 'Train', 'train_and pilot e-bike': 'Train', 'taxi': 'Taxi/Uber/Lyft', 'friend_picked me up': 'Gas Car, with others', 'carpool_w/ friend to work': 'Gas Car, with others', 'friend_carpool to work': 'Gas Car, with others', 'carpool_to work': 'Gas Car, with others', 'friend/co_worker carpool': 'Gas Car, with others', 'carpool_to lunch': 'Gas Car, with others', 'carpool': 'Gas Car, with others', 'carpool_for lunch': 'Gas Car, with others', 'carpool_lunch': 'Gas Car, with others', 'shared_ride': 'Gas Car, with others', 'e_car_shared_ride': 'E-car, with others', 'bikeshare': 'Bikeshare', 'scootershare': 'Scooter share', 'pilot_ebike': 'E-bike', 'e-bike': 'E-bike', 'walk': 'Walk', 'skateboard': 'Skate board', 'bike': 'Regular Bike', 'the_friend who drives us to work was running errands after the shift before dropping me off. not a trip of mine.': 'Not a Trip', 'not_a_trip': 'Not a Trip', 'no_travel': 'No Travel', 'same_mode': 'Same Mode'}

Seems like the keys are the replaced modes and the values are the cleaned modes. Straightforward enough for now.

rahulkulhalli commented 1 year ago

@shankari The notebook uses a file called "Can Do Colorado eBike Program". Seems to incorporate some socio-economic data in the model as well. It's used in the following way:

socio_data = pd.read_csv('./Can Do Colorado eBike Program - en.csv')
socio_data.rename(columns={'Unique User ID (auto-filled, do not edit)':'user_id',
                          'Please identify which category represents your total household income, before taxes, for last year.':'HHINC',
                          'How many motor vehicles are owned, leased, or available for regular use by the people who currently live in your household?':'VEH',
                           'In which year were you born?':'AGE',
                          'Including yourself, how many people live in your home?':'HHSIZE',
                          'How many children under age 18 live in your home?':'CHILDREN',
                          'What is your gender?':'GENDER',
                          'If you were unable to use your household vehicle(s), which of the following options would be available to you to get you from place to place?':'available_modes',
                          'Are you a student?':'STUDENT',
                          "Including yourself, how many people have a driver's license in your household?":'DRIVERS'}, inplace=True)

I can't see this file anywhere in the repo. Would you happen to have it?

shankari commented 1 year ago

The file is part of the dataset. As you can see from the columns, it contains user specific information and should not be checked in as part of the repo. It will also be different for different programs, which is another reason why it should not be checked in to the repo.

rahulkulhalli commented 1 year ago

I was inspecting this code block:


if "mode_confirm" in expanded_ct.columns:
        expanded_ct['Mode_confirm']= expanded_ct['mode_confirm'].map(dic_re)
    if study_type == 'program':
        # CASE 2 of https://github.com/e-mission/em-public-dashboard/issues/69#issuecomment-1256835867
        if 'replaced_mode' in expanded_ct.columns:
            expanded_ct['Replaced_mode']= expanded_ct['replaced_mode'].map(dic_re)
        else:
            print("This is a program, but no replaced modes found. Likely cold start case. Ignoring replaced mode mapping")
    else:
            print("This is a study, not expecting any replaced modes.")

@shankari, are all the non-null instances of 'replaced_mode' labeled by the users?

This is the count distribution for the existing replaced modes in the dataset:

Image

Is the "Unlabeled" category the one we intend to focus on?

shankari commented 1 year ago

@rahulkulhalli,

are all the non-null instances of 'replaced_mode' labeled by the users?

Is the "Unlabeled" category the one we intend to focus on?

somewhat. Note that the "Unlabeled" category can also have inferred labels from the user models. You may have to change the data pre-processing to fix this (again, don't rely too much on this formulation, it is a starting point, not something that I have reviewed).

What we really really want to focus on are trips with no inferred labels, or inferrred replaced modes that have a low probability/confidence (e.g. under 0.25). The inferred replaced modes come from the user-specific models.

We can also consider trying to handle unlabeled trips that have a high confidence inferred replaced mode from the user-specific model, but use this more generic model as an ensemble with it. But that is a next-level enhancement.

However, we cannot use the current replaced mode labels as a feature since they are already used in the user-specific models. We cannot use any of the user labels (mode_confirm, purpose_confirm or replaced_mode) because they are the Y labels in the training set for the user-specific models

rahulkulhalli commented 1 year ago

Understood.

It seems that the author is replacing all the NaNs in replaced_mode with "Unlabeled".


expanded_ct.replaced_mode.isna().sum() # prints 3268

# Replace the null instances with "Unlabeled"
expanded_ct['replaced_mode'] = expanded_ct['replaced_mode'].fillna('Unlabeled')
expanded_ct.loc[expanded_ct['replaced_mode'] == 'Unlabeled', 'Replaced_mode'] = "Unlabeled"

The value counts above also show 3268 values corresponding to Unlabeled.

rahulkulhalli commented 1 year ago

@shankari Have we thought about incorporating weather data in our predictions? We could analyze what the correlation between the weather conditions and the replaced modes are. For instance, if it was raining, I would opt for a Gas Car instead of walking to my destination.

shankari commented 1 year ago

We have discussed weather from time to time but it has not been incorporated yet, primarily due to the lack of a local archived dataset that we could find.

The notebook that you see is everything that we have done in this space. Feel free to experiment with weather and any other features that you can find a good historical dataset for.

rahulkulhalli commented 1 year ago

I know of a very reliable open-source weather API platform: https://open-meteo.com/en/docs/historical-weather-api

I will try to incorporate this info with our data and see what results we may get.

Follow-up question: OpenMeteo provides historical weather data for a particular lat-lng as well. Just confirming whether it is an acceptable parameter to pass to the API and doesn't violate our data privacy laws.

shankari commented 1 year ago

we should check their privacy policy and whether they log incoming requests. I am not sure that the lat/lon is a lot more accurate than just the city level weather, and the city-level weather also gives us a lot more opportunity for caching data and not abusing their API on production. So my vote is to start with city-level.

rahulkulhalli commented 1 year ago

That's a great point - I also vote on starting with city-level weather info. I shall start writing a small notebook/script to join weather data with our existing data.

rahulkulhalli commented 1 year ago

Started with forking Zack's code:

https://github.com/rahulkulhalli/em-public-dashboard

rahulkulhalli commented 1 year ago

I've narrowed-down the following historical weather variables that may be viable for analysis:

Image

Reasoning behind the choices:

Since OpenMeteo also offers variables like solar radiation and AQI, I think we can also experiment with these additional variables.

rahulkulhalli commented 1 year ago

Expectedly so, I get the following error while trying to retrieve data from the OpenMeteo API endpoint:

requests.exceptions.SSLError: HTTPSConnectionPool(host='archive-api.open-meteo.com', port=443): Max retries exceeded with url: /v1/archive?latitude=39.7392&longitude=-104.9847&start_date=08%2F15%2F2016&end_date=12%2F31%2F2022&hourly=temperature_2m%2Crelativehumidity_2m%2Cdewpoint_2m%2Cprecipitation%2Ccloudcover%2Cwindspeed_10m%2Cwindgusts_10m (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)')))

I'm looking into the feasibility of using this to pass SSL validation.

rahulkulhalli commented 1 year ago

Update: Solved. I didn't implement any SSL workarounds but directly downloaded the CSV file from OpenMeteo instead. 🧠

rahulkulhalli commented 1 year ago

Attempting to join the weather data with our dataset. Before merging, I decided to check how the start_loc was distributed:

Image

Most of the points are concentrated in North America, but the distribution is all across the country. I've download Denver-specific weather data. I can either:

  1. Concentrate solely on the Denver data for the POC and expand if optimistic results are seen
  2. Account for the spread and download weather data for the observations in the other states/countries as well.

Image

rahulkulhalli commented 1 year ago

Clocking out for the day. I've decided to focus only on the Colorado data points.

I've finished merging the travel data and the weather data and I'm currently checking for visually discernible relationships between the new features and the target variable.

image

This plot shows us the distribution of the replaced modes w.r.t. temperature (in C). I will look at this plot in-depth tomorrow and post my observations here.

shankari commented 1 year ago

What is the current list of features?

rahulkulhalli commented 1 year ago

@shankari I'm still in the feature exploration phase, so I haven't finalized the list of features for primary modeling.

That being said, the current list of weather parameters are:

['temperature_2m (°C)', 'relativehumidity_2m (%)', 'dewpoint_2m (°C)', 
'precipitation (mm)', 'cloudcover (%)', 'windspeed_10m (km/h)', 
'windgusts_10m (km/h)']

Also, we may include some AQI parameters as well. OpenMeteo does support AQI; unfortunately, however, the support started in 2021. There is an open-source service called OpenAQ that provides historical AQIs given a generic location name.

rahulkulhalli commented 1 year ago

Clocking out for the day. I've decided to focus only on the Colorado data points.

I've finished merging the travel data and the weather data and I'm currently checking for visually discernible relationships between the new features and the target variable.

image

This plot shows us the distribution of the replaced modes w.r.t. temperature (in C). I will look at this plot in-depth tomorrow and post my observations here.

Commenting on the box-plot above:

rahulkulhalli commented 1 year ago

I also have some observations about the new features' inter-correlation:

image
rahulkulhalli commented 1 year ago

Some interesting density plots:

image
rahulkulhalli commented 1 year ago

Part deux:

image
rahulkulhalli commented 1 year ago

Some inferences that I draw from the distribution plots above:

shankari commented 1 year ago

Ideal traveling conditions seem to be during 10-30 km/h wind gusts

This is not very intuitive. I suspect it is a bad quality input.

rahulkulhalli commented 1 year ago

Agreed. Again, the weather data that I've collected is what OpenMeteo points to when I input "Denver" (lat: 39.7392, lng: -104.9847), so the granularity might not be the best. We could:

shankari commented 1 year ago

I would vote for just dropping it from the feature list at this point, or convert into a categorical variable with a 2-3 bins (e.g. 0-30, 30-60). Intuitively, big wind gusts should affect use of active transportation and maybe even willingness to travel, but that might also be captured in the related wind speed variable.

rahulkulhalli commented 1 year ago

I also vote for dropping the feature momentarily and later try a round of modeling with the feature either included or binned.

Additionally, I like my idea of computing the cluster center of the geospatial data and using that abstracted location as the source of weather information. It definitely won't skew the pertinent information by much, although it might be a better representation of the demographic data.

The plot below is the cluster centroid as a function of the lat-lng input pairs.

image
rahulkulhalli commented 1 year ago

Update: Changed the weather info data so that it now originates from the red-colored cross above (after reverse geocoding, that place is: Conifer, CO 80433. The weather data has changed a bit, but not to the point where it makes a drastic difference.

Leaving for class now. I have the following things on my plate after getting back:

rahulkulhalli commented 1 year ago

Update: Changed the weather info data so that it now originates from the red-colored cross above (after reverse geocoding, that place is: Conifer, CO 80433. The weather data has changed a bit, but not to the point where it makes a drastic difference.

Leaving for class now. I have the following things on my plate after getting back:

  • Replace the precipitation variable with its individual components - rainfall and snow.
  • Remove wind gusts
  • Build a v0.0.1 model with start lat-lng, weather parameters, and travel times
  • Discuss with Jack tomorrow about psychological considerations for factors relevant for replaced mode predictions
rahulkulhalli commented 1 year ago

However, we cannot use the current replaced mode labels as a feature since they are already used in the user-specific models. We cannot use any of the user labels (mode_confirm, purpose_confirm or replaced_mode) because they are the Y labels in the training set for the user-specific models

@shankari With reference to the comment made above, is it valid to use section_modes and section_distances in the feature set?

image
shankari commented 1 year ago

@rahulkulhalli yes, sensor-based inferences are based on the location traces and don't use prior labels at all, so they are fair game. Note that, as we know from the discussions around MobilityNet, they may not be super accurate.

Build a v0.0.1 model with start lat-lng, weather parameters, and travel times

As we discussed earlier, the standard econometric mode choice models use time and cost.

rahulkulhalli commented 1 year ago

Update 1: Had a very interesting talk with @JGreenlee and discussed some important factors that might be worth considering when choosing a travel mode replacement. My notes:

1. Purpose and destination may be important factors to consider (since they could be factors to indicate anticipation). I informed Jack that we aren't using the purpose attribute.
2. An aggregate gas price/public transport fee should be enough because they haven't fluctuated by a lot over the years
3. Time of day is also very important
4. Traffic conditions are also important - for e.g., a user may want to skip using a car in peak rush time and opt for the bus instead.
rahulkulhalli commented 1 year ago

Update 2: I also just read this paper from students at UIC surrounding travel mode choice modeling using econometric data. It describes their mode choice modeling approach using gradient-boosted trees and a neural network. What was more important for me was to see what data they used - as it turns out, they use publicly-available dataset from CMAP (Chicago Metropolitan Agency for Planning) and use demographic information from it. These are some of the features that they use:

image

While reading Zack's original analysis, I also noticed that we have some demographic information that could be used to model the cost aspect. I don't have access to this data, but I managed to find something similar to it here (thank you, Natalie!)

@shankari Is this the right dataset? If not, may I know where I can procure it from?

rahulkulhalli commented 1 year ago

As for Colorado traffic data, that is available here. It is publicly released by DRCOG (Denver Regional Council of Governments) and has 24-hour traffic count data from 2010.

shankari commented 1 year ago

@rahulkulhalli as I said during our conversation on Teams, mode choice modeling has an extensive literature in the transportation world. While @JGreenlee has a background in psychology, I don't think he has worked on travel behavior modeling before.

Using demographics is standard - please see my note on Teams when I suggested this project earlier.

Predict the replaced mode for trips with no inferred labels by trying to build a mode choice model using demographics

rahulkulhalli commented 1 year ago

@shankari Thank you for the clarification! Is this where I can find the demographic data from?

https://www.nrel.gov/transportation/secure-transportation-data/tsdc-2020-can-do-colorado-e-bike-pilot-program.html

rahulkulhalli commented 1 year ago

I am also pondering about how the section_modes and section_distances can be incorporated appropriately. My initial thought is to split the a single row with the combined section_modes and section_distances into multiple rows with single entries.

shankari commented 1 year ago

if you want to use sensed modes, you can:

shankari commented 1 year ago

BTW, I noticed that you had already asked about demographics:

I can't see this file anywhere in the repo. Would you happen to have it?

I answered the question asked (it will not be in the repo) but not the one that was unasked (where is it then?!) my apologies!

rahulkulhalli commented 1 year ago

@shankari, while parsing the demographics CSV, I can see that there are duplicate records for some users. Zack did the following:

socio_data = socio_data.sort_values(by=['user_id', 'Timestamp'])
socio_data.drop_duplicates(subset=['user_id'], keep='last', inplace=True)

He sorted the users by ascending order of their timestamps and removed all but their last entry. Should I also follow this methodology or is there a different strategy you'd like me to use?

rahulkulhalli commented 1 year ago

@shankari Zack also mapped the discrete hour_of_day and month_of_year variables onto a cyclical sine and cosine function. I see no problem with the choice of embedding, but I do have a question - is the survey timestamp indicative of the time at which the survey was captured? If so, does it make sense to include the survey timing info in the feature set? I ask this because I'm not sure how the time of survey makes intuitive sense in predicting what the replaced mode could be.

shankari commented 1 year ago

@rahulkulhalli please check the code carefully. I would assume that Zack mapped the hour_of_day and month_of_year of the trip as a model parameter. As we discussed, the demographic survey is a one-time request while installing the app/onboarding.

rahulkulhalli commented 1 year ago

Oops, I missed that. Thank you for clarifying!

rahulkulhalli commented 1 year ago

data.Replaced_mode = data.Replaced_mode.replace(
    ['Gas Car, drove alone',  'Gas Car, with others', 'Bikeshare', 'Scooter share',
    'Regular Bike', 'Skate board', 'Train', 'Free Shuttle', 'Bus', 'Walk',
    'Taxi/Uber/Lyft', 'E-bike', 'No Travel'],
    ['car', 's_car', 's_micro', 's_micro', 'p_micro', 'p_micro', 'transit', 'transit', 'transit',
    'walk', 'ridehail', 'ebike', 'no_travel']
)

@shankari Should I retain this mapping for the first baseline model? I'd like to clarify that I will NOT be using this feature in the independent feature list while training.

shankari commented 1 year ago

@rahulkulhalli what is your recommendation based on your reading of the mode choice model literature and the desired use case? What are the pros and cons of the different approaches? It would be great if you could make recommendations (backed up with justifications) that I could provide feedback on.

rahulkulhalli commented 1 year ago

Since I couldn't work yesterday, I decided to do some reading. Here's an excerpt of data and modeling choices used by other authors:

The authors compare mode choice transport models using an ANN and a multinomial logit method. The data includes demographic and socioeconomic characteristics (age, gender, household, car ownership, driver license, income, education level, travel time, and Distance to destination). In this study, a multinomial logit model is used to understand the commuter’s mode choice of {car, bus and vanpool}.

Dependent variables: Gender, Age, Education Employment Income, HH Size, Vehicle Ownership, Purpose of Trip, Travel distance, Travel time, Parking cost, Parking availability, Car Price, Fuel price, Toll cost, Comfort of car, Number of transfers, Reliability, Bus frequency, overall quality of bus service, Coverage, Affective motives, Instrumental Motives, Symbolic motives. Target: Mode {car, bus}


What are my observations from this literature review?

My recommendations:

rahulkulhalli commented 1 year ago

These are the replaced_modes before mapping:

image

Some of these (such as pilot_ebike or golf_cart ) can easily be mapped to one of our predefined labels. However, what would instances like zip-line or time_spent on the clock at amazon be mapped to?

Also, I heard back from Bingrong - she is going to share the draft paper for replacement_mode prediction for my reference.

rahulkulhalli commented 1 year ago

These are the cost factors that have already been implemented:

image

So we're basically doing cost[section] = cost_factors_init[section] + (cost_factors[section] * distance[section])

However, in the previous implementation, the cost factors were most likely derived from either the mode_confirm or replaced_mode. If I'm not mistaken, we're not not supposed to use any of this information while creating our features. In that case, could we could use section_modes and section_distances (remember what @shankari said above - if working at the trip level, take the maximum or work on the section level)?