Running Wen's notebook - Githubissues

rahulkulhalli commented 1 year ago

Re-running Wen's notebook as a sanity check

rahulkulhalli commented 1 year ago

Observation 1: Wen's notebook uses plotly as an additional dependency. Observation 2: The diff_SD_plot method also uses a plotly backend called kaleido.

Installed both. Maybe having this in the README would be better?

rahulkulhalli commented 1 year ago

So Wen's notebook runs without a hitch. Some additional instructions are definitely required in the readme.

rahulkulhalli commented 1 year ago

Now to think aloud about a pertinent issue - the usage of k-fold CV during inference.

I have personally never encountered the usage of k-fold CV during inference. If the intended use-case is to introduce uncertainty in the predictions, this is generally achieved using a dropout layer with P(drop)=0.5 and re-run the same instance multiple times through the model. This approach is called MCMC Dropout.

However, I do not know how this approach can be translated to a bagging (random forest) model.

rahulkulhalli commented 1 year ago


model_names = list(performance_eval.PREDICTORS.keys())

cv_results = performance_eval.cv_for_all_algs(
    uuid_list=all_users,
    expanded_trip_df_map=expanded_labeled_trip_df_map,
    model_names=model_names,
    override_prior_runs=False,
    k=4, # 4-fold 
    raise_errors=False,
    random_state=42,
)

RFc_df = pd.DataFrame(cv_results['random forests (coordinates)'])

## We do some metric scaling here. Not important for this analysis.

# get validation_trips
validation_trips = RFc_df[RFc_df['dataset'] == 'validation_dataset']

# get test_trips
test_trips = RFc_df[RFc_df['dataset'] != 'validation_dataset']

rahulkulhalli commented 1 year ago

At this point, we're just splitting the inference data into two sets. What happens here onwards?


# Metadata addition and modification.
validation_trips = validation_trips.rename(columns={"mode_initial": "mode_confirm"})
validation_trips['os'] = ['ios' if x == 'DwellSegmentationDistFilter' else 'android' for x in validation_trips['source']]

validation_trips['user_id'] = validation_trips['user_id'].astype(str)
# This is an important step. Note that the user_ids are all bson.binary.Binary files saved as type 3. Why don't we use user_id.get_as_uuid() instead?

rahulkulhalli commented 1 year ago

Individual analysis

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/b0fab0bf-fab3-40c1-a84c-1c97f3aa3627)

shankari commented 1 year ago

Observation 1: Wen's notebook uses plotly as an additional dependency. Observation 2: The diff_SD_plot method also uses a plotly backend called kaleido. Installed both. Maybe having this in the README would be better?

We don't have random READMEs in which we tell people to install software. We should add the new packages with appropriate versions to the environment.yml file for this repo

rahulkulhalli commented 1 year ago

We create user-specific confusion matrices based on an attribute (distance/duration) by pd.crosstab. Note to self: these confusion matrices have been derived using the testing data.

rahulkulhalli commented 1 year ago

Ah, elt stands for Expanded Labeled Trips.

rahulkulhalli commented 1 year ago

For every user in the validation split,
  For every trip that the user has taken,
    calculate the mean and variance of the carbon emission for the trip.  # (A)
    calculate the mean energy consumption and variance for a single user labeled trip. # (B)
    compute (A) - (B)
  compute sum(A) and sum(B)  # (C)
  compute the relative error using (C)
  compute sum(A) - sum(B)
  append all the trip-level info to a copy of the trip data frame

shankari commented 1 year ago

last thing that Wen was working on was 3 users with a full CM, high accuracy but ~ 10 variance difference. why?! why?!

This is what you should have answers for

rahulkulhalli commented 1 year ago

true modes distribution:

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/44314a09-3ad6-467a-8049-28adc8f62631)

predicted modes distribution:

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/5d0f5113-66b9-4daf-b37f-91a31b7141bc)

Let's investigate further...

shankari commented 1 year ago

so the e7b2... user doesn't have the same set of modes for training and test. And for the other two, the numbers don't seem that small. The second user (ending in 2227) seems like an easy one to focus on since they have only two modes 😄

rahulkulhalli commented 1 year ago

user_id	dif_expected_user_laberd_mean	expected_mean	user_labeled_mean	all_mode_expected_SD_EC
405b221a-be9e-43bc-86a5-7ca7fccf2227	2.4891	333.592204	331.103104	153.282462

The expected mean value is very close to the actual labeled mean value. Investigating further...

rahulkulhalli commented 1 year ago

PRIMARY_ID = "405b221a-be9e-43bc-86a5-7ca7fccf2227"
grouped_df_sorted.loc[grouped_df_sorted.user_id == PRIMARY_ID, ['confusion_var', 'user_var', 'confusion_sd', 'user_sd']]

confusion_var	user_var	confusion_sd	user_sd
6851.595876	6521.285375	159.467879	155.714236

rahulkulhalli commented 1 year ago

grouped_df_sorted.loc[~grouped_df_sorted.user_id.isin([PRIMARY_ID]), ['confusion_var', 'user_var', 'confusion_sd', 'user_sd']].describe()

	confusion_var	user_var	confusion_sd	user_sd
count	173.000000	173.000000	173.000000	173.000000
mean	5378.488085	1750.903962	144.961449	79.474468
std	37009.279968	9068.731209	242.422639	149.453618
min	0.000000	0.000000	0.000000	0.000000
25%	26.708845	4.996648	13.633379	5.267473
50%	208.123117	67.134296	56.066512	25.076507
75%	1285.588001	534.730039	167.227026	80.076458
max	458908.264690	104438.604075	1708.455132	1048.542778

rahulkulhalli commented 1 year ago

The readings seem to be in order. Sure, the user_var is higher than the mean of all other observations, but the maximum value is 104438.6, which rules out this suspicion. Investigating further...

shankari commented 1 year ago

I am not sure I understand what this is showing. we are getting the entries? (aka trips?) for this user and computing stats over the set of trips. But I thought that we are not looking at var and sd of individual trips any more, per the method in "Count Every Trip" and/or multinomial trips. @allenmichael099 can you confirm?

rahulkulhalli commented 1 year ago

Clocking out for a bit in anticipation of bad network reception. Will resume the investigation upon reaching the hotel.

rahulkulhalli commented 1 year ago

Uploading some results from yesterday's analysis.

Label distribution for 405b221a-be9e-43bc-86a5-7ca7fccf2227 (mode_pred=predicted and mode_true=GT)

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/8fbe3019-30cf-4919-8eb6-fdeefe2725bb)

rahulkulhalli commented 1 year ago

Hmm, distance and duration distributions across the val and test splits are a little off, but not by much. Some things about Wen's work are still unclear to me. Now checking some more feature distributions.

rahulkulhalli commented 1 year ago

Distance and duration histograms for all other users other than the target user_id. Longitudinal features are aggregated by computing the mean of each feature.

rahulkulhalli commented 1 year ago

Distances and durations for 405b221a-be9e-43bc-86a5-7ca7fccf2227

rahulkulhalli commented 1 year ago

Number of trips made per user. The wider bar colored red is the target ID. Note that there are user_ids with lower number of. trips as compared to the target.

rahulkulhalli commented 1 year ago

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/f68b5e0b-b1c0-4d38-a3a7-5d54f265ca29)

Target user's histogram for distance (validation split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/44fdbb4d-ac22-4b46-83b9-9d32e207c09e)

Target user's histogram for distance (validation split) grouped by true mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/c9feb854-0100-4c6d-b6c9-3adf3ad6a5e7)

Target user's histogram for duration (validation split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/ab7c0579-7a5f-4c28-8078-88fa78f0ed80)

Target user's histogram for duration (validation split) grouped by true mode

rahulkulhalli commented 1 year ago

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/cf5065a2-f44c-43df-8423-428de19c6e7d)

Target user's histogram for distance (test split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/8b75ce0e-277f-4fe6-8ed4-4b7ec15b212a)

Target user's histogram for distance (test split) grouped by true mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/1a547766-8f68-4438-8327-b2439951736e)

Target user's histogram for duration (test split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/bc9e763f-11a9-436d-a8cb-5e5a8408205d)

Target user's histogram for duration (test split) grouped by true mode

rahulkulhalli commented 1 year ago

Confusion matrix on testing data for 405b221a-be9e-43bc-86a5-7ca7fccf2227:

Confusion matrix on validation data for 405b221a-be9e-43bc-86a5-7ca7fccf2227:

allenmichael099 commented 1 year ago

Thanks for these @rahulkulhalli! Here are my thoughts:

Looking at "Target user's histogram for distance (test split) grouped by true mode" and "Target user's histogram for distance (validation split) grouped by true mode":
- The participant traveled different spreads of distances in the test and validation sets for drove alone and for shared ride. Shorter car trips are more common in the test set than the validation set.
Looking at the confusion matrices:
- since there is never a misprediction of (gt with others, pred alone), we're way too confident in drove alone predictions. P(gt alone | pred alone) = 1, variance = 0.
- in the test set, we have: P( gt alone | pred with others) = 0.007834, P(gt w others | pred with others) = 0.9922
- in the validation set we have: P( gt alone | pred with others) = 0.01765, P(gt w others | pred with others) = 0.9823
- This "slight difference" in distributions matters! you can get an error that is an arbitrary "number of variances" by just applying these estimates to enough miles traveled. In our case, the distance is large enough that the error is bigger than one variance. (still not the measure you want to use anyway. The sd is 3 orders of magnitude smaller than the error, which immediately tells us we have a bad model of the underlying prediction making process (even though the predictions themselves work well in this context))

allenmichael099 commented 1 year ago

Adding to my thoughts above: The difference in P(true | predicted) between test and validation will likely be larger for the other participants with more modes. With the user above, there are only 2 modes, so with the number of trips present we could maybe estimate distributions well-ish ( I don't think we estimated the "alone" column well though).

With the other participants, the distribution estimates will be even less certain since there are more modes. Also, there could easily be cases where proportion of total distance traveled in mode A in test is different than proportion of total distance traveled in mode A in validation. A way to check this more easily than 2 histograms would be to calculate and compare proportions: (true mode distance proportions in test) vs (true mode distance proportions in validation) (eg 0.1, 0.9 vs 0.2, 0.8)

shankari commented 1 year ago

which immediately tells us we have a bad model of the underlying prediction making process (even though the predictions themselves work well in this context))

Note that this is not the sensed-mode predictions. This is (if Wen has implemented it correctly) label assist using a random forest model. Random forest has been well-studied in the literature for multiple years now. Both the training and the test datasets are drawn from the trips for the same user - you can't get much more representative than that.

I don't understand why would we have a bad model of the underlying prediction making process in this case

shankari commented 1 year ago

@rahulkulhalli I would also like to see interpretations from you, and exploratory data analysis driven by the results that you are seeing.

rahulkulhalli commented 1 year ago

@shankari Yes, I'm formulating my analyses. My initial thoughts:

The label counts indicate that the "Gas car, *" classes occupy a majority of mode instances in the dataset. After computing the share, I observe that the two modes account for 46.32% of the total mode counts. In my opinion, since the user_id in question contains modes that are high in frequency, we are overfitting to this user.

rahulkulhalli commented 1 year ago

For all the three user_ids in question, their labels all belong to the top 4 most frequently occurring modes. This could be why we obtain extremely confident results for these users.

rahulkulhalli commented 1 year ago

I can perform an analysis on users who have modes exclusively belonging to the lower half of the frequency table and verify this assumption.

rahulkulhalli commented 1 year ago

which immediately tells us we have a bad model of the underlying prediction making process (even though the predictions themselves work well in this context))

Note that this is not the sensed-mode predictions. This is (if Wen has implemented it correctly) label assist using a random forest model. Random forest has been well-studied in the literature for multiple years now. Both the training and the test datasets are drawn from the trips for the same user - you can't get much more representative than that.

I don't understand why would we have a bad model of the underlying prediction making process in this case

I believe it would be prudent to investigate the training procedure once. In my experience, class imbalance may be a major factor for a model that exhibits high variance and low bias. A way to remediate it would be to add class weights and enable bootstrapping. The latter adds a slight regularizing effect.

rahulkulhalli commented 1 year ago

@allenmichael099 I'm attaching the true mode distance proportions across both the splits:

shankari commented 1 year ago

The label counts indicate that the "Gas car, *" classes occupy a majority of mode instances in the dataset. After computing the share, I observe that the two modes account for 46.32% of the total mode counts. In my opinion, since the user_id in question contains modes that are high in frequency, we are overfitting to this user.

Are you concerned about overfitting to this user or to overfitting the model more generally?

if overfitting to this user, that is by design. The label-assist models are user-specific models. Both the training and testing dataset are from the data generated by a single user
if overfitting the model, I don't see that as the problem here. In general, my understanding is that if we overfit the model to the training set, we will get low accuracy when we evaluate on a separate test dataset. But in this case, we are getting excellent accuracy on the test dataset. Arguably we also get excellent accuracy on the validation dataset (you can verify this).

Note that we are not super picky here - we just want the difference between GT and computed to be within 1 variance. And the difference is small; it is just that the variance is even smaller.

This "slight difference" in distributions matters!

We did sensitivity analysis on the multinomial distributions in Grace's paper, so we should be able to identify what point that is. But IIRC, it was not that small.

by just applying these estimates to enough miles traveled.

For Grace's paper, we converted the numbers to km since the multinomial method only works for integers. Has that been done here?

If that has already been done, note that Grace's paper uses the combined distances across users, and uses CMs from two different programs, which are likely to be much more different than the CMs from the same user.

I don't see the probability-based CM here, @rahulkulhalli should verify https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/issues/1#issuecomment-1682794218

And the difference was within one variance in that case.

WenZhang-Vivien / downstream_matrices_evaluations

Running Wen's notebook #1