Investigating the high variance counts for certain users and modes in label assist

rahulkulhalli commented 1 year ago

A continuation of the conversation regarding the evaluation of Wen's notebook.

Since I do not have write access in Wen's repository, I cannot move the issue here directly.

Until I can find a way to move the previous conversations here, the previous conversation may be found here: https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/issues/1

shankari commented 1 year ago

@rahulkulhalli I think you have to copy and paste the comments manually. To move, you need write access to both repos. The only way to accomplish that would be for Wen to give me write access to her repo.

rahulkulhalli commented 1 year ago

Shall I upload screenshots? That should help expedite the process.

shankari commented 1 year ago

Copying and pasting the comments should not take a lot longer than screenshots (edit -> select all -> copy -> paste) and will retain searchability

rahulkulhalli commented 1 year ago

All the previously made comments are copied after this message:

rahulkulhalli commented 1 year ago

Rahul: Re-running Wen's notebook as a sanity check Observation 1: Wen's notebook uses plotly as an additional dependency. Observation 2: The diff_SD_plot method also uses a plotly backend called kaleido.

Installed both. Maybe having this in the README would be better?

So Wen's notebook runs without a hitch. Some additional instructions are definitely required in the readme.

Now to think aloud about a pertinent issue - the usage of k-fold CV during inference.

I have personally never encountered the usage of k-fold CV during inference. If the intended use-case is to introduce uncertainty in the predictions, this is generally achieved using a dropout layer with P(drop)=0.5 and re-run the same instance multiple times through the model. This approach is called MCMC Dropout.

However, I do not know how this approach can be translated to a bagging (random forest) model.

model_names = list(performance_eval.PREDICTORS.keys())

cv_results = performance_eval.cv_for_all_algs(
    uuid_list=all_users,
    expanded_trip_df_map=expanded_labeled_trip_df_map,
    model_names=model_names,
    override_prior_runs=False,
    k=4, # 4-fold 
    raise_errors=False,
    random_state=42,
)

RFc_df = pd.DataFrame(cv_results['random forests (coordinates)'])

## We do some metric scaling here. Not important for this analysis.

# get validation_trips
validation_trips = RFc_df[RFc_df['dataset'] == 'validation_dataset']

# get test_trips
test_trips = RFc_df[RFc_df['dataset'] != 'validation_dataset']

At this point, we're just splitting the inference data into two sets. What happens here onwards?

# Metadata addition and modification.
validation_trips = validation_trips.rename(columns={"mode_initial": "mode_confirm"})
validation_trips['os'] = ['ios' if x == 'DwellSegmentationDistFilter' else 'android' for x in validation_trips['source']]

validation_trips['user_id'] = validation_trips['user_id'].astype(str)
# This is an important step. Note that the user_ids are all bson.binary.Binary files saved as type 3. Why don't we use user_id.get_as_uuid() instead?

Individual analysis

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/b0fab0bf-fab3-40c1-a84c-1c97f3aa3627)

Shankari:

We don't have random READMEs in which we tell people to install software. We should add the new packages with appropriate versions to the environment.yml file for this repo

Rahul:

We create user-specific confusion matrices based on an attribute (distance/duration) by pd.crosstab. Note to self: these confusion matrices have been derived using the testing data.

Ah, elt stands for Expanded Labeled Trips.

For every user in the validation split,
  For every trip that the user has taken,
    calculate the mean and variance of the carbon emission for the trip.  # (A)
    calculate the mean energy consumption and variance for a single user labeled trip. # (B)
    compute (A) - (B)
  compute sum(A) and sum(B)  # (C)
  compute the relative error using (C)
  compute sum(A) - sum(B)
  append all the trip-level info to a copy of the trip data frame

Shankari:

last thing that Wen was working on was 3 users with a full CM, high accuracy but ~ 10 variance difference. why?! why?! This is what you should have answers for

Rahul:

true modes distribution:

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/44314a09-3ad6-467a-8049-28adc8f62631)

predicted modes distribution:

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/5d0f5113-66b9-4daf-b37f-91a31b7141bc)

Let's investigate further...

Shankari:

so the e7b2... user doesn't have the same set of modes for training and test. And for the other two, the numbers don't seem that small. The second user (ending in 2227) seems like an easy one to focus on since they have only two modes 😄

Rahul:

user_id	dif_expected_user_laberd_mean	expected_mean	user_labeled_mean	all_mode_expected_SD_EC
405b221a-be9e-43bc-86a5-7ca7fccf2227	2.4891	333.592204	331.103104	153.282462

The expected mean value is very close to the actual labeled mean value. Investigating further...

PRIMARY_ID = "405b221a-be9e-43bc-86a5-7ca7fccf2227"
grouped_df_sorted.loc[grouped_df_sorted.user_id == PRIMARY_ID, ['confusion_var', 'user_var', 'confusion_sd', 'user_sd']]

confusion_var	user_var	confusion_sd	user_sd
6851.595876	6521.285375	159.467879	155.714236

grouped_df_sorted.loc[~grouped_df_sorted.user_id.isin([PRIMARY_ID]), ['confusion_var', 'user_var', 'confusion_sd', 'user_sd']].describe()

	confusion_var	user_var	confusion_sd	user_sd
count	173.000000	173.000000	173.000000	173.000000
mean	5378.488085	1750.903962	144.961449	79.474468
std	37009.279968	9068.731209	242.422639	149.453618
min	0.000000	0.000000	0.000000	0.000000
25%	26.708845	4.996648	13.633379	5.267473
50%	208.123117	67.134296	56.066512	25.076507
75%	1285.588001	534.730039	167.227026	80.076458
max	458908.264690	104438.604075	1708.455132	1048.542778

The readings seem to be in order. Sure, the user_var is higher than the mean of all other observations, but the maximum value is 104438.6, which rules out this suspicion. Investigating further...

Shankari:

I am not sure I understand what this is showing. we are getting the entries? (aka trips?) for this user and computing stats over the set of trips. But I thought that we are not looking at var and sd of individual trips any more, per the method in "Count Every Trip" and/or multinomial trips. @allenmichael099 can you confirm?

Rahul:

Clocking out for a bit in anticipation of bad network reception. Will resume the investigation upon reaching the hotel.

Uploading some results from yesterday's analysis.

Label distribution for 405b221a-be9e-43bc-86a5-7ca7fccf2227 (mode_pred=predicted and mode_true=GT)

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/8fbe3019-30cf-4919-8eb6-fdeefe2725bb)

Hmm, distance and duration distributions across the val and test splits are a little off, but not by much. Some things about Wen's work are still unclear to me. Now checking some more feature distributions.

Distance and duration histograms for all other users other than the target user_id. Longitudinal features are aggregated by computing the mean of each feature.

Distances and durations for 405b221a-be9e-43bc-86a5-7ca7fccf2227

Number of trips made per user. The wider bar colored red is the target ID. Note that there are user_ids with lower number of. trips as compared to the target.

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/f68b5e0b-b1c0-4d38-a3a7-5d54f265ca29)

Target user's histogram for distance (validation split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/44fdbb4d-ac22-4b46-83b9-9d32e207c09e)

Target user's histogram for distance (validation split) grouped by true mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/c9feb854-0100-4c6d-b6c9-3adf3ad6a5e7)

Target user's histogram for duration (validation split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/ab7c0579-7a5f-4c28-8078-88fa78f0ed80)

Target user's histogram for duration (validation split) grouped by true mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/cf5065a2-f44c-43df-8423-428de19c6e7d)

Target user's histogram for distance (test split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/8b75ce0e-277f-4fe6-8ed4-4b7ec15b212a)

Target user's histogram for distance (test split) grouped by true mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/1a547766-8f68-4438-8327-b2439951736e)

Target user's histogram for duration (test split) grouped by predicted mode

![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/bc9e763f-11a9-436d-a8cb-5e5a8408205d)

Target user's histogram for duration (test split) grouped by true mode

Confusion matrix on testing data for 405b221a-be9e-43bc-86a5-7ca7fccf2227:

Confusion matrix on validation data for 405b221a-be9e-43bc-86a5-7ca7fccf2227:

Michael: Thanks for these @rahulkulhalli! Here are my thoughts: - Looking at "Target user's histogram for distance (test split) grouped by true mode" and "Target user's histogram for distance (validation split) grouped by true mode": - The participant traveled different spreads of distances in the test and validation sets for drove alone and for shared ride. - - Shorter car trips are more common in the test set than the validation set. - Looking at the confusion matrices: - since there is never a misprediction of (gt with others, pred alone), we're way too confident in drove alone predictions. P(gt alone | pred alone) = 1, variance = 0. - in the test set, we have: P( gt alone | pred with others) = 0.007834, P(gt w others | pred with others) = 0.9922 - in the validation set we have: P( gt alone | pred with others) = 0.01765, P(gt w others | pred with others) = 0.9823 - This "slight difference" in distributions matters! you can get an error that is an arbitrary "number of variances" by just applying these estimates to enough miles traveled. In our case, the distance is large enough that the error is bigger than one variance. (still not the measure you want to use anyway. The sd is 3 orders of magnitude smaller than the error, which immediately tells us we have a bad model of the underlying prediction making process (even though the predictions themselves work well in this context))

Adding to my thoughts above: The difference in P(true | predicted) between test and validation will likely be larger for the other participants with more modes. With the user above, there are only 2 modes, so with the number of trips present we could maybe estimate distributions well-ish ( I don't think we estimated the "alone" column well though). With the other participants, the distribution estimates will be even less certain since there are more modes. Also, there could easily be cases where proportion of total distance traveled in mode A in test is different than proportion of total distance traveled in mode A in validation. A way to check this more easily than 2 histograms would be to calculate and compare proportions: (true mode distance proportions in test) vs (true mode distance proportions in validation) (eg 0.1, 0.9 vs 0.2, 0.8)

Shankari: > which immediately tells us we have a bad model of the underlying prediction making process (even though the predictions themselves work well in this context)) Note that this is not the sensed-mode predictions. This is (if Wen has implemented it correctly) label assist using a random forest model. Random forest has been well-studied in the literature for multiple years now. Both the training and the test datasets are drawn from the trips for the same user - you can't get much more representative than that. I don't understand why would we have a bad model of the underlying prediction making process in this case @rahulkulhalli I would also like to see interpretations from you, and exploratory data analysis driven by the results that you are seeing.

Rahul: @shankari Yes, I'm formulating my analyses. My initial thoughts:

The label counts indicate that the "Gas car, *" classes occupy a majority of mode instances in the dataset. After computing the share, I observe that the two modes account for 46.32% of the total mode counts. In my opinion, since the user_id in question contains modes that are high in frequency, we are overfitting to this user. For all the three user_ids in question, their labels all belong to the top 4 most frequently occurring modes. This could be why we obtain extremely confident results for these users. I can perform an analysis on users who have modes exclusively belonging to the lower half of the frequency table and verify this assumption. > > which immediately tells us we have a bad model of the underlying prediction making process (even though the predictions themselves work well in this context)) > Note that this is not the sensed-mode predictions. This is (if Wen has implemented it correctly) label assist using a random forest model. Random forest has been well-studied in the literature for multiple years now. Both the training and the test datasets are drawn from the trips for the same user - you can't get much more representative than that. > > I don't understand why would we have a bad model of the underlying prediction making process in this case I believe it would be prudent to investigate the training procedure once. In my experience, class imbalance may be a major factor for a model that exhibits high variance and low bias. A way to remediate it would be to add class weights and enable bootstrapping. The latter adds a slight regularizing effect. @allenmichael099 I'm attaching the true mode distance proportions across both the splits: ![image](https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/assets/17728123/6d977c1b-82e3-477c-8b62-8f2d2545e43d)

Shankari: > The label counts indicate that the "Gas car, *" classes occupy a majority of mode instances in the dataset. After computing the share, I observe that the two modes account for 46.32% of the total mode counts. In my opinion, since the user_id in question contains modes that are high in frequency, we are overfitting to this user. Are you concerned about overfitting to this user or to overfitting the model more generally? - if overfitting to this user, that is by design. The label-assist models are user-specific models. Both the training and testing dataset are from the data generated by a single user - if overfitting the model, I don't see that as the problem here. In general, my understanding is that if we overfit the model to the training set, we will get low accuracy when we evaluate on a separate test dataset. But in this case, we are getting excellent accuracy on the test dataset. Arguably we also get excellent accuracy on the validation dataset (you can verify this). Note that we are not super picky here - we just want the difference between GT and computed to be within 1 variance. And the difference is small; it is just that the variance is even smaller. > This "slight difference" in distributions matters! We did sensitivity analysis on the multinomial distributions in Grace's paper, so we should be able to identify what point that is. But IIRC, it was not that small. > by just applying these estimates to enough miles traveled. For Grace's paper, we converted the numbers to km since the multinomial method only works for integers. Has that been done here? If that has already been done, note that Grace's paper uses the combined distances across users, and uses CMs from two different programs, which are likely to be much more different than the CMs from the same user. I don't see the probability-based CM here, @rahulkulhalli should verify https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/issues/1#issuecomment-1682794218 And the difference was within one variance in that case.

rahulkulhalli commented 1 year ago

New conversations will continue from here.

rahulkulhalli commented 1 year ago

The per-split proportion of distance traveled per mode. (Grouped by user, grouped by mode_true)

rahulkulhalli commented 1 year ago

CM data for the three users (L: test set, R: val set). As mentioned by @shankari, the previous validation CM is incorrectly scaled.

rahulkulhalli commented 1 year ago

Q to self: Is the underlying model user-specific or is it an aggregate model?

To answer this, I need to delve into the cv_for_all_algs method.


model_names = list(performance_eval.PREDICTORS.keys())

cv_results = performance_eval.cv_for_all_algs(
    uuid_list=all_users,
    expanded_trip_df_map=expanded_labeled_trip_df_map,
    model_names=model_names,
    override_prior_runs=False,
    k=4, # 4-fold 
    raise_errors=False,
    random_state=42,
)

# in the cv_for_all_algs method:

for model_name in model_names:
        csv_path = f'first_trial_results/cv results {model_name}.csv'
        if not override_prior_runs and os.path.exists(csv_path):
            print('loading prior cross validation data for model:', model_name)
            cv_df = pd.read_csv(csv_path,
                                keep_default_na=False,
                                na_values=[''])            
            cv_df = cv_df.drop(['Unnamed: 0'], axis=1)

Okay, so we simply read off of a cached CSV file. That's intuitive, but where is the training? Ah, it's in the else condition.

It seems that the model hyperparameters are retrieved from a global dict and passed to the cv_for_all_users method. Let's follow that invocation chain...

# in the cv_for_all_users method:

for user in uuid_list:
        try:

            # Lot of comments here. Removed for succintness.

            results = k_fold_cross_val_predict(model,
                                        model_params,
                                        user_df=expanded_trip_df_map[user],
                                        k=k,
                                        random_state=random_state,
                                        min_samples=min_samples)

Okay, so my initial hunch is that the models seem to have been trained on a per-user basis. Following the k_fold_cross_val_predict method now:


kfolds = StratifiedKFold(n_splits=k, random_state=random_state, shuffle=True)

if model_params is not None:
        model_.set_params(model_params)

# Some sort of cleaning and preprocessing happens here.
user_df = model_._clean_data(user_df)
user_df = model_.preprocess_data(user_df)

# Hmm, why is a hard-coded seed passed here even if the method has a random state argument? Not a deal-breaker.
temp_df, validation_data = train_test_split(modified_df, test_size=0.2, random_state=49, stratify = modified_df.mode_true)

# This could be done better by using random.choice() just once with the 'n' parameter. Not a deal-breaker.
for _, row in single_count_df.iterrows():
    random_set = np.random.choice(["temp", "validation"], p=[0.8, 0.2])
    if random_set == "temp":
        temp_df = temp_df.append(row)
    else:
        validation_data = validation_data.append(row)

rahulkulhalli commented 1 year ago

Delved DEEP into the training code because I couldn't see the dependent variable being decoupled from the independent variable set. Finally found the decoupling in the fit method in models.py.

rahulkulhalli commented 1 year ago

Just an observation: The terminologies used during training are a little confusing. The train/test split was termed temp and validation data. Furthermore, when this 'temp' data is used to create the k-folds, the sub-splits are now called the train and test data.

rahulkulhalli commented 1 year ago


# Using temp_df to fit the model for predicting validation dataset
model_.fit(temp_df)

This does not make sense to me. We've already trained on this data by creating k folds of train-val data. Why are we fitting it to the model again? This would over-write the learned model parameters. Is this intentional?

We are also using the SAME model instance in every fold! This is not ideal; we should ideally be initializing the model INSIDE the k-fold loop. By not doing this, we keep updating the same model's parameters!

To illustrate what is happening:


temp, val = do_preprocessing()

model = Model()

for (train_ix, test_ix) in k_fold.split(temp, temp['mode_true']):
    train_split = data.iloc[train_ix]
    test_split = data.iloc[test_ix]

    # notice what is happening here - the same model's parameters keep getting overwritten.
    model.fit(train_split)

    # perform inference using the test_split

# Even after the iterative training, we call .fit() on the entire temp dataset here. Why?
model.fit(temp)

rahulkulhalli commented 1 year ago

We now know that the modeling was intended at the user level. I noticed three things:

No model storage: Once the model is trained, it is used for inference (results are captured) and disposed.
The model's API is coupled with data preprocessing. These should ideally be different objects.
The models may not have been trained properly (see my comment above). If this is true, the re-training process must be completed prior to performing any downstream analyses.

shankari commented 1 year ago

No model storage: Once the model is trained, it is used for inference (results are captured) and disposed.

Correct. That is fine for exploratory analysis, since we will not be applying new, incoming data to the model for prediction. The associated data collection effort ended in Dec 2022, so we are no longer getting data from these users. Since we are training user-specific models, we are developing and evaluating a methodology here, not a model.

The model's API is coupled with data preprocessing. These should ideally be different objects.

Can you point to where this is happening? I believe that Hannah's work did a pretty decent job wrt handling pre-processing and standard model interfaces (@humbleOldSage), but I have not reviewed Wen's code at all.

EDIT: I see the do_preprocessing. I am going to take a look at Wen's code tonight or over the weekend

The models may not have been trained properly (see my comment above). If this is true, the re-training process must be completed prior to performing any downstream analyses.

Agree that this seems super weird. That alone might explain the inconsistent results. I assume you are fixing this and retraining to see if that fixes everything...

rahulkulhalli commented 1 year ago

Yes, I am currently re-training the models. I need to find a way to decouple the preprocessing stage, but I can revisit that as a TODO optimization for now.

rahulkulhalli commented 1 year ago

Current update: Refactored the code and decoupled preprocessing. Modeling-specific enhancements:

Changed the place where the model was invoked from outside the k-fold iteration to INSIDE the loop
Added class weights to the model to handle imbalanced class distributions (can be turned off; just experimenting for now)
Added F1 score evaluation for the k-fold splits

rahulkulhalli commented 1 year ago

I have a strong opinion on the current training process:

If we're using fixed hyper-parameters, why are we using k-fold CV? k-fold CV is used for optimal hyper-parameter selection; once the optimal configuration is found, we train on ONE 80-20 train-test split and report the results.
My opinion is: let's perform k-fold CV (for each user) with a variety of hyper-parameters and select the best HP settings for the specific user; then, we use that model to train on an 80-20 split.

shankari commented 1 year ago

I think that part of the reason we went with k-fold cross-validation was that either for this paper or a prior iteration of this paper, we got feedback saying "you are not doing this properly, have you have considered doing k-fold cross validation" when we were doing k-fold cross validation already.

So I think we interpreted that as "reviewer wants to see k-fold crossvalidation even for the evaluation"

Did we use the same method for Hannah's paper or did we do 1 train + 1 test + 1 validation for that? I think that that least in the journal version, we did 5 fold CV, computed the accuracy of each fold and then took mean or something like that. I'm find with changing to use CV purely for hyperparameter tuning + final evaluation round as long as we can find sufficient related work and a justification for why it is better
- hypothesis: re-training on optimal parameters is better because then we can use a larger training set and test set so instead 60:20, we can get to 80:20
  - validate this hypothesis
- we should potentially dig deeper into does this vary depending on dataset size per user
  - maybe one approach works better for users with small number of labeled trips
  - another works better for larger number of labeled trips
  - note that the mean/median of the splits approach could allow us to do some form of ensemble on the values we get from the splits which we should compare with the others.

validate the hypothesis around weights
- I expect it to work better but we should show that it does and quantify how much better it is
re-running with the modification
- k-fold cross-validation: 80:20 split, mean and median of each fold and plot across users in two graphs (one for median and one for mean, x axis is user labeled trip count so we can see that correlation between F-score and method and dataset size)
- k-fold cross-validation: 80:20 split for tuning, run 80:20 split on the full dataset with optimal parameters (per user)
from a computer perspective, we can do this for all users but I'm not sure the result will be readable. If we can try this with something like seaborn which continuous linegraph with error bars. If it is useful, we can keep it, or we can down to interesting users. box plots across F-scores for each split for a small subset of "interesting" users.

Compare with what we have already done in Hannah's paper, which I think is "mean F-scores across splits" using the sklearn method.

rahulkulhalli commented 1 year ago

Continuing my investigation. So far, this is what I've observed:

The model API has been written in a very hacky way. It is very hard for me to add extra keyword arguments to the classifier through the current API
The bootstrap parameter was set to False. Was this intentional? If random forests are allowed to bootstrap, they create trees on different sub-samples of data. This allows the model to generalize much better as compared to fitting a tree on the entire dataset. By disabling the bootstrap feature, we're basically creating n_estimator trees, and each tree is fit on the entire dataset.

My strategy is as follows:


train, test.= create_train_test_split(test_size=0.2, stratify=data.mode_true, ...)

# create parameter grid

for parameter_config in parameter_grid:
    for (train_ix, test_ix) in kfolds.split(train, train['mode_true']):
        model = Model()
        model.set_params(parameter_config)
        model.fit(train[train_ix])

       # record split-level validation performance
       model.predict(train[test_ix])

# this will give us the best-performing configuration over the 80% training split. 

# fit to the entire train dataset
final_model.fit(train)

# Now, we run inference on the test set ONLY once. 
final_model.predict(test)

# Going back and changing the model to increase the test score is adding manual bias.

This strategy is also sometimes called the nested cross-validation strategy, because this helps us choose the best hyperparameters for a given model over k-folds of the data.

rahulkulhalli commented 1 year ago

I am also attaching resources that strengthen my hypothesis:

rahulkulhalli commented 1 year ago

Q to self and @shankari: For each hyper parameter configuration, is averaging the model performance across the k-folds better or is argmax better?

I'd argue that averaging is better because it would capture the SNR better. Argmax would also be an over-optimistic model performance indicator in cases where a certain split was 'better' than the other splits.

However, we could also plot per-fold performance graphs. I am choosing the weighted F1 metric as our performance indicator.

rahulkulhalli commented 1 year ago

Training is in progress. The current values for the hyperparameter search are:


    PARAM_GRID = {
        'n_estimators': [100, 150, 200, 250, 300],
        # max_depth adds a regularizing effect.
        'max_depth': [None] + [10, 50, 100],
        'min_samples_split': [2, 3],
        'min_samples_leaf': [1],
        'max_features': ['log2', 'sqrt', None],
        'bootstrap': [True],
        'class_weight': ['balanced', 'balanced_subsample']
    }

Now, what search strategy should we use? We have two options:

Grid search: Exhaustive search over every possible combination of the hyper-parameters. In theory, this is a better approach as compared to a random walk. However, in practice, this is a ridiculously expensive operation.

For instance, for the hyper-parameter settings above, we have 5*4*2*1*3*1*2 = 240 possible combinations. Now, for each of these 240 combinations, we run 5-fold CV. That brings the total number of iterations to 240*5 = 1200. We do this for three models - the mode predictor, the purpose predictor, and the replaced predictor. So 1200 * 3 = 3600 iterations for a SINGLE user.

Random search: As mentioned above, an exhaustive brute-force search is very computationally expensive and time-intensive. Random search will sample from the given ranges (uniformly) and try a combination of these samples for a maximum number of times. In our case, I've restricted the number of random samples to 20. Now, the number of iterations become 20*5*3 = 300 per user, which is much more feasible. The obvious pitfall of this approach is that we may not get the optimal combination of hyper-parameter settings, but if the possible ranges for each HP are kept to a minimum, we MAY happen to attain said combination.

Therefore, I have chosen the random walk with n=20 sampling iterations per user. We may choose to increase/decrease this number or even augment the range of each HP down the line.

rahulkulhalli commented 1 year ago

A sample log output for a user after random search:

INFO:root:Best HP for user <redacted>: {'n_estimators': 300, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'bootstrap': True, 'class_weight': 'balanced_subsample'}
INFO:root:Mean Val F1 for best HP settings: 0.7946092857251922

Initial metrics:

~5 minutes per user if said user has significant trips, ~3 minutes per user if said user has low-moderate number of trips. ~1.6 minutes/user on average
I have enabled logging into a local file to view training metrics on the go. This helps me modify the training process

shankari commented 1 year ago

The bootstrap parameter was set to False. Was this intentional? If random forests are allowed to bootstrap, they create trees on different sub-samples of data. This allows the model to generalize much better as compared to fitting a tree on the entire dataset. By disabling the bootstrap feature, we're basically creating n_estimator trees, and each tree is fit on the entire dataset.

Agreed on using a random seed instead.

I'd argue that averaging is better because it would capture the SNR better. Argmax would also be an over-optimistic model performance indicator in cases where a certain split was 'better' than the other splits.

I agree that if we had to choose now, argmax is sub-optimal. If we can plot them since we will get all the folds anyway, that might be nice just to make sure that everything looks good before we build on it further.

Therefore, I have chosen the random walk with n=20 sampling iterations per user. We may choose to increase/decrease this number or even augment the range of each HP down the line.

Random walk is fine for now. We can try grid search on HPC later if time permits.

~5 minutes per user if said user has significant trips, ~3 minutes per user if said user has low-moderate number of trips. ~1.6 minutes/user on average I have enabled logging into a local file to view training metrics on the go. This helps me modify the training process

Sounds good. My only comment on logging to file is that when I ran similar comparisons for the original paper, I had issues with the notebook hanging but I basically moved the modeling code to a python file (regenerate_classification_performance_results.py) so I could redirect stdout/stderr as well and not worry about browser hanging. Hannah was already saving the result to a file, so only the visualization happens in the notebook.

shankari commented 1 year ago

intuitively divide into 3 groups randomly pick 3 within each group 3 plots with 3 subplots each, or 9x9 subplots to visualize confusion matrices

rahulkulhalli commented 1 year ago

User subsampling is complete and initial results are ready. The following is the process I used to determine the strata of the user space:

Note: The allCEO data has 284 users in total.

Computing the percentage of labeled trips

user_df = n_trips_df.groupby('user_id').sum()[['all_trips', 'labeled_trips']].reset_index(drop=False, inplace=False)
user_df['labeled_ratio'] = (user_df['labeled_trips']/user_df['all_trips'])*100.

Selecting users with a labeling ratio of 0.6 or higher. I found this number to be suitable after plotting the histogram of the number of trips each user had taken.

user_df = user_df.loc[user_df.labeled_ratio >= 60, :]

Dividing the 1st level filter into three representative groups. The first group is where each user has taken <= 1000 trips, the second group is where each user has taken between 1001 and 1800 trips, and the last group is where each user has taken more than 1800 trips. These boundaries were determined after some experimentation.

group1 = user_df.loc[user_df.labeled_trips <= 1000, :]
group2 = user_df.loc[(user_df.labeled_trips > 1000) & (user_df.labeled_trips <= 1800), :]
group3 = user_df.loc[user_df.labeled_trips > 1800, :]

This gives me the following numbers:

group1 has 59 users, group2 has 14 users, group3 has 7 users
group1 has 18657 trips, group2 has 19319 trips, group3 has 17326 trips

Which is roughly equal in number of trips per category.

rahulkulhalli commented 1 year ago

Now, I know chose n=3 as the number of representative samples from each group. To do so, I uniformly sample 3 elements from each group (without replacement) and form my final cohort. Now, I'm ready to run an extensive hyper-parameter search and fitting process across all the 9 users.

rahulkulhalli commented 1 year ago

Albeit we can't use an exhaustive grid search (YET), I have bumped up the number of grid samples from 20 to 25. Now, the number of iterations are: 25 (HP settings) * 5 (folds) * 9 (users) = 1125. The training is ongoing since the past 43 minutes, and I will report here once I have the results.

shankari commented 1 year ago

@rahulkulhalli where are the initial results?

rahulkulhalli commented 1 year ago

@shankari Enclosing the initial results.

The user pool for initial training: ['355e25bd-fc24-4c5e-85d3-58e39432bd44', '00db212b-c8d0-44cd-8392-41ab4065e603', '1da31f30-f183-4fc5-bca5-a1bee71072bb', 'cba570ae-38f3-41fa-a625-7342727377b7', 'bf776197-ee89-4183-8a0a-04c7fa7228e2', 'ece8b0a5-0953-4e98-a0d3-69f25de4a206', 'c7ce889c-796f-4e2a-8859-fa2d7d5068fe', 'bf774cbe-6c30-40b0-a022-278d36a23f19', '0b3e78fa-91d8-4aa6-a320-3440143c8c16']

Observations:

Train-test performance is consistent across all users
Every user now has varying hyper-parameters

Results:

INFO:root:  --------------------------------------------------
INFO:root:  Best HP for user 355e25bd-fc24-4c5e-85d3-58e39432bd44: {'n_estimators': 150, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'log2', 'bootstrap': True, 'class_weight': 'balanced_subsample'}
INFO:root:  Mean Val F1 for best HP settings: 0.7206162477186813
INFO:root:  Test F1 for user 355e25bd-fc24-4c5e-85d3-58e39432bd44: 0.7055555555555556
INFO:root:  Modeling done for 355e25bd-fc24-4c5e-85d3-58e39432bd44
INFO:root:  Training for 355e25bd-fc24-4c5e-85d3-58e39432bd44 took 165.83971786499023 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO:root:  Best HP for user 00db212b-c8d0-44cd-8392-41ab4065e603: {'n_estimators': 300, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'bootstrap': True, 'class_weight': 'balanced'}
INFO:root:  Mean Val F1 for best HP settings: 0.6687525118477938
INFO:root:  Test F1 for user 00db212b-c8d0-44cd-8392-41ab4065e603: 0.7022667082195214
INFO:root:  Modeling done for 00db212b-c8d0-44cd-8392-41ab4065e603
INFO:root:  Training for 00db212b-c8d0-44cd-8392-41ab4065e603 took 300.12543296813965 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO:root:  Best HP for user 1da31f30-f183-4fc5-bca5-a1bee71072bb: {'n_estimators': 250, 'max_depth': 10, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'bootstrap': True, 'class_weight': 'balanced'}
INFO:root:  Mean Val F1 for best HP settings: 0.8152247885133826
INFO:root:  Test F1 for user 1da31f30-f183-4fc5-bca5-a1bee71072bb: 0.851013160210812
INFO:root:  Modeling done for 1da31f30-f183-4fc5-bca5-a1bee71072bb
INFO:root:  Training for 1da31f30-f183-4fc5-bca5-a1bee71072bb took 187.1464807987213 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO:root:  Best HP for user cba570ae-38f3-41fa-a625-7342727377b7: {'n_estimators': 300, 'max_depth': 10, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'bootstrap': True, 'class_weight': 'balanced_subsample'}
INFO:root:  Mean Val F1 for best HP settings: 0.8248276274591119
INFO:root:  Test F1 for user cba570ae-38f3-41fa-a625-7342727377b7: 0.8265114379825381
INFO:root:  Modeling done for cba570ae-38f3-41fa-a625-7342727377b7
INFO:root:  Training for cba570ae-38f3-41fa-a625-7342727377b7 took 344.9400601387024 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO:root:  Best HP for user bf776197-ee89-4183-8a0a-04c7fa7228e2: {'n_estimators': 300, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'bootstrap': True, 'class_weight': 'balanced_subsample'}
INFO:root:  Mean Val F1 for best HP settings: 0.7507172726979194
INFO:root:  Test F1 for user bf776197-ee89-4183-8a0a-04c7fa7228e2: 0.7647097796289921
INFO:root:  Modeling done for bf776197-ee89-4183-8a0a-04c7fa7228e2
INFO:root:  Training for bf776197-ee89-4183-8a0a-04c7fa7228e2 took 506.9399151802063 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO:root:  Best HP for user ece8b0a5-0953-4e98-a0d3-69f25de4a206: {'n_estimators': 150, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None, 'bootstrap': True, 'class_weight': 'balanced_subsample'}
INFO:root:  Mean Val F1 for best HP settings: 0.7766333349063153
INFO:root:  Test F1 for user ece8b0a5-0953-4e98-a0d3-69f25de4a206: 0.8013531586934782
INFO:root:  Modeling done for ece8b0a5-0953-4e98-a0d3-69f25de4a206
INFO:root:  Training for ece8b0a5-0953-4e98-a0d3-69f25de4a206 took 575.6503212451935 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO:root:  Best HP for user c7ce889c-796f-4e2a-8859-fa2d7d5068fe: {'n_estimators': 250, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None, 'bootstrap': True, 'class_weight': 'balanced'}
INFO:root:  Mean Val F1 for best HP settings: 0.731732063686907
INFO:root:  Test F1 for user c7ce889c-796f-4e2a-8859-fa2d7d5068fe: 0.7504759128439011
INFO:root:  Modeling done for c7ce889c-796f-4e2a-8859-fa2d7d5068fe
INFO:root:  Training for c7ce889c-796f-4e2a-8859-fa2d7d5068fe took 958.5499167442322 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO:root:  Best HP for user bf774cbe-6c30-40b0-a022-278d36a23f19: {'n_estimators': 300, 'max_depth': 100, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'log2', 'bootstrap': True, 'class_weight': 'balanced'}
INFO:root:  Mean Val F1 for best HP settings: 0.7314780718686599
INFO:root:  Test F1 for user bf774cbe-6c30-40b0-a022-278d36a23f19: 0.7246074849090828
INFO:root:  Modeling done for bf774cbe-6c30-40b0-a022-278d36a23f19
INFO:root:  Training for bf774cbe-6c30-40b0-a022-278d36a23f19 took 795.4598858356476 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO:root:  Best HP for user 0b3e78fa-91d8-4aa6-a320-3440143c8c16: {'n_estimators': 100, 'max_depth': 10, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'bootstrap': True, 'class_weight': 'balanced'}
INFO:root:  Mean Val F1 for best HP settings: 0.8918008591269659
INFO:root:  Test F1 for user 0b3e78fa-91d8-4aa6-a320-3440143c8c16: 0.8801300307364435
INFO:root:  Modeling done for 0b3e78fa-91d8-4aa6-a320-3440143c8c16
INFO:root:  Training for 0b3e78fa-91d8-4aa6-a320-3440143c8c16 took 771.0192790031433 seconds.
INFO:root:++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Now, the next step is to obtain the F1 scores for the same users using the previous method and draw comparisons

shankari commented 1 year ago

@rahulkulhalli these are the results after splitting the users into three buckets, right?

What about the comment from https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1699519034

Training is in progress. The current values for the hyperparameter search are:
which buckets are these users from?

rahulkulhalli commented 1 year ago

Yep, these are results after splitting the users into three buckets.

The first approach seems to be infeasible on my Mac. I kept the notebook running 12+ hours oddly enough, it had only processed ~120/284 users till then. This also happened a second time when I kept the training running in the background for ~6 hours. To make the most efficient use of time, I decided to start with the 9 user IDs first, report their results, and then go back to training on the entire dataset.

Ah, I forgot to mention which group each user was from. My apologies. I will add that info right now.

rahulkulhalli commented 1 year ago

Group 1 (Labeling ratio >= 60%, users with <= 1000 trips): '355e25bd-fc24-4c5e-85d3-58e39432bd44', '00db212b-c8d0-44cd-8392-41ab4065e603', '1da31f30-f183-4fc5-bca5-a1bee71072bb'

Group 2 (Labeling ratio >= 60%, users with > 1000 and <= 1800 trips): 'cba570ae-38f3-41fa-a625-7342727377b7', 'bf776197-ee89-4183-8a0a-04c7fa7228e2', 'ece8b0a5-0953-4e98-a0d3-69f25de4a206'

Group 3 (Labeling ratio >= 60%, users with > 1800 trips): 'c7ce889c-796f-4e2a-8859-fa2d7d5068fe', 'bf774cbe-6c30-40b0-a022-278d36a23f19', '0b3e78fa-91d8-4aa6-a320-3440143c8c16'

shankari commented 1 year ago

The first approach seems to be infeasible on my Mac. I kept the notebook running 12+ hours oddly enough, it had only processed ~120/284 users till then.

Please see my comment from https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1699603281 around moving the computation to a python script. Did you do that already?

rahulkulhalli commented 1 year ago

@shankari Yes, I did. I tried the second attempt using the script.

shankari commented 1 year ago

@rahulkulhalli so you wrote a script that corresponded to the code in the notebook and you ran it and...what happened?

How many users did it process?
Was there ongoing progress represented in the logs?

rahulkulhalli commented 1 year ago

Yes, I re-used the previous script with the logging enabled to a file. The training went on without a hitch for the first 80 users or so, but the performance started throttling soon after. It took about 8 hours for the first 80 users to be trained.

I kept the training on to finish overnight, but it seems the laptop idled, because there was no progress in the logs. In order to make the most efficient use of time, I decided to focus on the 9 users first and report their statistics.

rahulkulhalli commented 1 year ago

The following are the metrics using the previous approach:

User: 355e25bd-fc24-4c5e-85d3-58e39432bd44 val F1: 0.7258958145137007, test F1: 0.6644800925047741
User: 00db212b-c8d0-44cd-8392-41ab4065e603 val F1: 0.6939150692909879, test F1: 0.6812006905853845
User: 1da31f30-f183-4fc5-bca5-a1bee71072bb val F1: 0.7231059631059632, test F1: 0.8251474406996795
User: cba570ae-38f3-41fa-a625-7342727377b7 val F1: 0.8419228700201363, test F1: 0.8211590903889668
User: bf776197-ee89-4183-8a0a-04c7fa7228e2 val F1: 0.7601193067887864, test F1: 0.7436983856873711
User: ece8b0a5-0953-4e98-a0d3-69f25de4a206 val F1: 0.789176999760713, test F1: 0.7838428797094621
User: c7ce889c-796f-4e2a-8859-fa2d7d5068fe val F1: 0.7294965209611153, test F1: 0.7363946947711447
User: bf774cbe-6c30-40b0-a022-278d36a23f19 val F1: 0.7546109131300109, test F1: 0.7285851801762273
User: 0b3e78fa-91d8-4aa6-a320-3440143c8c16 val F1: 0.8871276443290376, test F1: 0.9031386565024732

Observations:

355e25bd-fc24-4c5e-85d3-58e39432bd44: Much better performance in the new method
00db212b-c8d0-44cd-8392-41ab4065e603: Lower validation f1, but better test f1 as compared to the older method
1da31f30-f183-4fc5-bca5-a1bee71072bb: Much better performance in the new method
cba570ae-38f3-41fa-a625-7342727377b7: Better performance in previous method, but newer method seems to have generalized better due to the difference between the validation and test F1
bf776197-ee89-4183-8a0a-04c7fa7228e2: Better performance in the new method
ece8b0a5-0953-4e98-a0d3-69f25de4a206: Better test F1 in the newer method, but seems to have generalized better in the previous method
c7ce889c-796f-4e2a-8859-fa2d7d5068fe: Same as the previous observation
bf774cbe-6c30-40b0-a022-278d36a23f19: Same test F1 across both, but seems to have generalized better in the new method
0b3e78fa-91d8-4aa6-a320-3440143c8c16: Lower test F1 than previous method, but indicates better generalization

rahulkulhalli commented 1 year ago

Opinion: Upon initial comparison, I think the newer approach is better than the previous one. Again, this is just a representative sample of the entire dataset. First, the validation F1 obtained from the new method is an average, so it's only an indicator on the actual validation performance. Second, these hyper-parameters may STILL not be the most optimal due to the sampling limit we've set. It may very well be that the grid search will yield better, more robust models.

Running this strategy across the entire dataset is a much better idea. However, we would ideally like to have access to the HPC.

rahulkulhalli commented 1 year ago

Next steps: Check the variance for the representative set. If there is a significant difference in the variance count, go back and re-train all the users. Otherwise, let's focus on the next stages after training.

rahulkulhalli commented 1 year ago

The next immediate step would be to compare the confusion matrices obtained using both the methods, run Wen's notebook, and see if there is a significant difference in the variance count. Ensure that you log your experiments in this issue as you're going!

shankari commented 1 year ago

The following are the metrics using the previous approach:

Can we have a side-by-side comparison of the previous and current approach?

rahulkulhalli commented 1 year ago

Q. to @shankari: Are there any other metrics that we'd like to use to compare the results obtained using both the methods?

rahulkulhalli commented 1 year ago

In my opinion, simply using the F1 score is not the right way to determine whether one model is doing better than the other. Instead, we should look at:

Count confusion matrix (which would help us to determine the precision and recall for each models)
Probability distribution of predicted labels v/s the probability distribution of the GT labels (this would help us understand whether the underlying model fit properly to the GT distribution)
A scatter plot of F1 scores w.r.t. the number of trips taken

rahulkulhalli commented 1 year ago

Now, why is the training process slowing down after 80 users or so?

Memory leaks?
Unoptimized script?

As @shankari mentioned, we must be doing something that causes performance throttling after a period of time. We should definitely pinpoint the issue before just migrating to an HPC environment and running the same script(s).

rahulkulhalli commented 1 year ago

Okay, now I'm diving head-first into the side-by-side comparison between the two methods. Here's what.I have in mind:

Retrieve/generate the predictions using the older and the newer method
Check the count-based confusion matrices side-by-side: do you see any difference? Has the precision/recall changed significantly?
Compute the cross-tabulation for both the methods and make a side-by-side comparison. Are there any significant changes?
Run Wen's code to compute the variance counts using both the predictions and see if there's a major difference in the numbers.

rahulkulhalli commented 1 year ago

Uploading the confusion matrices obtained using the NEW method.

![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/c46924fd-2dff-4eaf-9f39-023660f37fc8) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/80ea3cf6-e7bd-4300-a3ea-ec17ecacc339) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/1ef9c7cf-ed4b-4937-b90c-7c0e62826437) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/4e998b22-345a-4c00-b2ac-db1f585c9191) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/bb746e1e-e032-4ef9-b8ce-21896cf89259) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/59a8d557-629a-4ed8-8522-b879377f7d3f) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/e97a8a89-0119-4d2c-8434-a70c8bf71a76) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/57eecd75-a8bb-4a5f-bf60-cf5c0cc88768) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/e21fcded-a625-426b-852f-88fe6834e422)

New method confusion matrices

rahulkulhalli commented 1 year ago

Uploading the confusion matrices obtained from the PREVIOUS method:

Previous method confusion matrices

![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/fe657d66-141c-4861-8622-c178baad8737) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/649398e7-b118-47a1-aaa0-49a37f54323f) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/f40aea38-fac5-43bf-ace2-4440b7464740) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/2af4aa86-c2d9-450d-8c44-3f238d46ab3f) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/1091b8b5-c0e0-4989-87b5-c2aec88a5bbe) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/65b6ff0a-0092-412e-baae-a5227180c3ff) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/d7be3c54-a8c7-47e9-9e8a-f74989014843) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/997022f8-1e6a-43ed-a591-b03c11c13ad2) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/f8f4ce5b-ad17-4cfc-8720-8172213020a1)

shankari commented 1 year ago

Can we see these side by side in a table? Also, skimming through the outputs, it looks like the numbers are off by quite a bit - should be able to validate if they are side by side.

rahulkulhalli commented 1 year ago

Agreed. I'm figuring out how we could do a side-by-side better,

e-mission / e-mission-docs

Investigating the high variance counts for certain users and modes in label assist #951