WenZhang-Vivien / downstream_matrices_evaluations

0 stars 1 forks source link

Running Wen's notebook #1

Open rahulkulhalli opened 1 year ago

rahulkulhalli commented 1 year ago

Re-running Wen's notebook as a sanity check

rahulkulhalli commented 1 year ago

Observation 1: Wen's notebook uses plotly as an additional dependency. Observation 2: The diff_SD_plot method also uses a plotly backend called kaleido.

Installed both. Maybe having this in the README would be better?

rahulkulhalli commented 1 year ago

So Wen's notebook runs without a hitch. Some additional instructions are definitely required in the readme.

rahulkulhalli commented 1 year ago

Now to think aloud about a pertinent issue - the usage of k-fold CV during inference.

I have personally never encountered the usage of k-fold CV during inference. If the intended use-case is to introduce uncertainty in the predictions, this is generally achieved using a dropout layer with P(drop)=0.5 and re-run the same instance multiple times through the model. This approach is called MCMC Dropout.

However, I do not know how this approach can be translated to a bagging (random forest) model.

rahulkulhalli commented 1 year ago

model_names = list(performance_eval.PREDICTORS.keys())

cv_results = performance_eval.cv_for_all_algs(
    uuid_list=all_users,
    expanded_trip_df_map=expanded_labeled_trip_df_map,
    model_names=model_names,
    override_prior_runs=False,
    k=4, # 4-fold 
    raise_errors=False,
    random_state=42,
)

RFc_df = pd.DataFrame(cv_results['random forests (coordinates)'])

## We do some metric scaling here. Not important for this analysis.

# get validation_trips
validation_trips = RFc_df[RFc_df['dataset'] == 'validation_dataset']

# get test_trips
test_trips = RFc_df[RFc_df['dataset'] != 'validation_dataset']
rahulkulhalli commented 1 year ago

At this point, we're just splitting the inference data into two sets. What happens here onwards?


# Metadata addition and modification.
validation_trips = validation_trips.rename(columns={"mode_initial": "mode_confirm"})
validation_trips['os'] = ['ios' if x == 'DwellSegmentationDistFilter' else 'android' for x in validation_trips['source']]

validation_trips['user_id'] = validation_trips['user_id'].astype(str)
# This is an important step. Note that the user_ids are all bson.binary.Binary files saved as type 3. Why don't we use user_id.get_as_uuid() instead?
rahulkulhalli commented 1 year ago
Individual analysis ![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/b0fab0bf-fab3-40c1-a84c-1c97f3aa3627)
shankari commented 1 year ago

Observation 1: Wen's notebook uses plotly as an additional dependency. Observation 2: The diff_SD_plot method also uses a plotly backend called kaleido. Installed both. Maybe having this in the README would be better?

We don't have random READMEs in which we tell people to install software. We should add the new packages with appropriate versions to the environment.yml file for this repo

rahulkulhalli commented 1 year ago

We create user-specific confusion matrices based on an attribute (distance/duration) by pd.crosstab. Note to self: these confusion matrices have been derived using the testing data.

rahulkulhalli commented 1 year ago

Ah, elt stands for Expanded Labeled Trips.

rahulkulhalli commented 1 year ago
For every user in the validation split,
  For every trip that the user has taken,
    calculate the mean and variance of the carbon emission for the trip.  # (A)
    calculate the mean energy consumption and variance for a single user labeled trip. # (B)
    compute (A) - (B)
  compute sum(A) and sum(B)  # (C)
  compute the relative error using (C)
  compute sum(A) - sum(B)
  append all the trip-level info to a copy of the trip data frame
shankari commented 1 year ago

last thing that Wen was working on was 3 users with a full CM, high accuracy but ~ 10 variance difference. why?! why?!

This is what you should have answers for

rahulkulhalli commented 1 year ago
true modes distribution: ![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/44314a09-3ad6-467a-8049-28adc8f62631)

predicted modes distribution: ![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/5d0f5113-66b9-4daf-b37f-91a31b7141bc)

Let's investigate further...

shankari commented 1 year ago

so the e7b2... user doesn't have the same set of modes for training and test. And for the other two, the numbers don't seem that small. The second user (ending in 2227) seems like an easy one to focus on since they have only two modes 😄

rahulkulhalli commented 1 year ago
user_id dif_expected_user_laberd_mean expected_mean user_labeled_mean all_mode_expected_SD_EC
405b221a-be9e-43bc-86a5-7ca7fccf2227 2.4891 333.592204 331.103104 153.282462

The expected mean value is very close to the actual labeled mean value. Investigating further...

rahulkulhalli commented 1 year ago
PRIMARY_ID = "405b221a-be9e-43bc-86a5-7ca7fccf2227"
grouped_df_sorted.loc[grouped_df_sorted.user_id == PRIMARY_ID, ['confusion_var', 'user_var', 'confusion_sd', 'user_sd']]
  confusion_var user_var confusion_sd user_sd
6851.595876 6521.285375 159.467879 155.714236
rahulkulhalli commented 1 year ago
grouped_df_sorted.loc[~grouped_df_sorted.user_id.isin([PRIMARY_ID]), ['confusion_var', 'user_var', 'confusion_sd', 'user_sd']].describe()
confusion_var user_var confusion_sd user_sd
count 173.000000 173.000000 173.000000 173.000000
mean 5378.488085 1750.903962 144.961449 79.474468
std 37009.279968 9068.731209 242.422639 149.453618
min 0.000000 0.000000 0.000000 0.000000
25% 26.708845 4.996648 13.633379 5.267473
50% 208.123117 67.134296 56.066512 25.076507
75% 1285.588001 534.730039 167.227026 80.076458
max 458908.264690 104438.604075 1708.455132 1048.542778
rahulkulhalli commented 1 year ago

The readings seem to be in order. Sure, the user_var is higher than the mean of all other observations, but the maximum value is 104438.6, which rules out this suspicion. Investigating further...

shankari commented 1 year ago

I am not sure I understand what this is showing. we are getting the entries? (aka trips?) for this user and computing stats over the set of trips. But I thought that we are not looking at var and sd of individual trips any more, per the method in "Count Every Trip" and/or multinomial trips. @allenmichael099 can you confirm?

rahulkulhalli commented 1 year ago

Clocking out for a bit in anticipation of bad network reception. Will resume the investigation upon reaching the hotel.

rahulkulhalli commented 1 year ago

Uploading some results from yesterday's analysis.

Label distribution for 405b221a-be9e-43bc-86a5-7ca7fccf2227 (mode_pred=predicted and mode_true=GT) ![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/8fbe3019-30cf-4919-8eb6-fdeefe2725bb)
rahulkulhalli commented 1 year ago

Hmm, distance and duration distributions across the val and test splits are a little off, but not by much. Some things about Wen's work are still unclear to me. Now checking some more feature distributions.

rahulkulhalli commented 1 year ago

Distance and duration histograms for all other users other than the target user_id. Longitudinal features are aggregated by computing the mean of each feature.

image

image

rahulkulhalli commented 1 year ago

Distances and durations for 405b221a-be9e-43bc-86a5-7ca7fccf2227

image

image

rahulkulhalli commented 1 year ago

image

Number of trips made per user. The wider bar colored red is the target ID. Note that there are user_ids with lower number of. trips as compared to the target.

rahulkulhalli commented 1 year ago
![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/f68b5e0b-b1c0-4d38-a3a7-5d54f265ca29) Target user's histogram for distance (validation split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/44fdbb4d-ac22-4b46-83b9-9d32e207c09e) Target user's histogram for distance (validation split) grouped by true mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/c9feb854-0100-4c6d-b6c9-3adf3ad6a5e7) Target user's histogram for duration (validation split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/ab7c0579-7a5f-4c28-8078-88fa78f0ed80) Target user's histogram for duration (validation split) grouped by true mode
rahulkulhalli commented 1 year ago
![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/cf5065a2-f44c-43df-8423-428de19c6e7d) Target user's histogram for distance (test split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/8b75ce0e-277f-4fe6-8ed4-4b7ec15b212a) Target user's histogram for distance (test split) grouped by true mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/1a547766-8f68-4438-8327-b2439951736e) Target user's histogram for duration (test split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/bc9e763f-11a9-436d-a8cb-5e5a8408205d) Target user's histogram for duration (test split) grouped by true mode

rahulkulhalli commented 1 year ago

Confusion matrix on testing data for 405b221a-be9e-43bc-86a5-7ca7fccf2227:

image

Confusion matrix on validation data for 405b221a-be9e-43bc-86a5-7ca7fccf2227:

image
allenmichael099 commented 1 year ago

Thanks for these @rahulkulhalli! Here are my thoughts:

image

allenmichael099 commented 1 year ago

Adding to my thoughts above: The difference in P(true | predicted) between test and validation will likely be larger for the other participants with more modes. With the user above, there are only 2 modes, so with the number of trips present we could maybe estimate distributions well-ish ( I don't think we estimated the "alone" column well though).

With the other participants, the distribution estimates will be even less certain since there are more modes. Also, there could easily be cases where proportion of total distance traveled in mode A in test is different than proportion of total distance traveled in mode A in validation. A way to check this more easily than 2 histograms would be to calculate and compare proportions: (true mode distance proportions in test) vs (true mode distance proportions in validation) (eg 0.1, 0.9 vs 0.2, 0.8)

shankari commented 1 year ago

which immediately tells us we have a bad model of the underlying prediction making process (even though the predictions themselves work well in this context))

Note that this is not the sensed-mode predictions. This is (if Wen has implemented it correctly) label assist using a random forest model. Random forest has been well-studied in the literature for multiple years now. Both the training and the test datasets are drawn from the trips for the same user - you can't get much more representative than that.

I don't understand why would we have a bad model of the underlying prediction making process in this case

shankari commented 1 year ago

@rahulkulhalli I would also like to see interpretations from you, and exploratory data analysis driven by the results that you are seeing.

rahulkulhalli commented 1 year ago

@shankari Yes, I'm formulating my analyses. My initial thoughts:

image

The label counts indicate that the "Gas car, *" classes occupy a majority of mode instances in the dataset. After computing the share, I observe that the two modes account for 46.32% of the total mode counts. In my opinion, since the user_id in question contains modes that are high in frequency, we are overfitting to this user.

rahulkulhalli commented 1 year ago

For all the three user_ids in question, their labels all belong to the top 4 most frequently occurring modes. This could be why we obtain extremely confident results for these users.

rahulkulhalli commented 1 year ago

I can perform an analysis on users who have modes exclusively belonging to the lower half of the frequency table and verify this assumption.

rahulkulhalli commented 1 year ago

which immediately tells us we have a bad model of the underlying prediction making process (even though the predictions themselves work well in this context))

Note that this is not the sensed-mode predictions. This is (if Wen has implemented it correctly) label assist using a random forest model. Random forest has been well-studied in the literature for multiple years now. Both the training and the test datasets are drawn from the trips for the same user - you can't get much more representative than that.

I don't understand why would we have a bad model of the underlying prediction making process in this case

I believe it would be prudent to investigate the training procedure once. In my experience, class imbalance may be a major factor for a model that exhibits high variance and low bias. A way to remediate it would be to add class weights and enable bootstrapping. The latter adds a slight regularizing effect.

rahulkulhalli commented 1 year ago

@allenmichael099 I'm attaching the true mode distance proportions across both the splits:

image

shankari commented 1 year ago

The label counts indicate that the "Gas car, *" classes occupy a majority of mode instances in the dataset. After computing the share, I observe that the two modes account for 46.32% of the total mode counts. In my opinion, since the user_id in question contains modes that are high in frequency, we are overfitting to this user.

Are you concerned about overfitting to this user or to overfitting the model more generally?

Note that we are not super picky here - we just want the difference between GT and computed to be within 1 variance. And the difference is small; it is just that the variance is even smaller.

This "slight difference" in distributions matters!

We did sensitivity analysis on the multinomial distributions in Grace's paper, so we should be able to identify what point that is. But IIRC, it was not that small.

by just applying these estimates to enough miles traveled.

For Grace's paper, we converted the numbers to km since the multinomial method only works for integers. Has that been done here?

If that has already been done, note that Grace's paper uses the combined distances across users, and uses CMs from two different programs, which are likely to be much more different than the CMs from the same user.

I don't see the probability-based CM here, @rahulkulhalli should verify https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/issues/1#issuecomment-1682794218

And the difference was within one variance in that case.