Open rahulkulhalli opened 10 months ago
Also uploading the side-by-side F1 comparison for a better perspective:
@rahulkulhalli can you clarify what "old test" is here?
Success! I was able to (somewhat) consolidate all the CMs together
@shankari "old" is the previously-used modeling strategy (before fixing the hyper-parameter search and per-user model). I agree that it's a misnomer
I can immediately see that the sum of the values in "old test" (previously-used modeling strategy) and "new test" (altered modeling strategy) are extremely off. Why is that? Let me investigate!
By "previously-used modeling strategy (before fixing the hyper-parameter search and per-user model)", you mean that it is without the fix to stop using the same model in every fold? https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1692154598
Depending on how you are computing the CM, and where the values that are represented in the CM come from, that might account for the difference...
Yes, that would be correct. I am comparing the confusion matrices between the two methods.
"old" -> the previous modeling strategy where each fold was fit to the same model instance and THEN the entire training data was also fit to the same model instance
"new" -> the current strategy where we use hyper-parameter search and user-specific models
Ah! There's a shape mismatch between the previous test data and the current test data. I'll have to investigate the performance_eval.py
script and see where the discrepancy lies.
New val: (145, 59), Old val: (145, 59), New test: (14475, 59), Old test: (579, 59)
New val: (464, 59), Old val: (464, 59), New test: (46275, 59), Old test: (1851, 59)
New val: (84, 59), Old val: (85, 59), New test: (8400, 59), Old test: (335, 59)
New val: (45, 59), Old val: (45, 59), New test: (4425, 59), Old test: (177, 59)
New val: (628, 59), Old val: (629, 59), New test: (62725, 59), Old test: (2508, 59)
New val: (285, 59), Old val: (283, 59), New test: (28225, 59), Old test: (1131, 59)
New val: (649, 59), Old val: (649, 59), New test: (64925, 59), Old test: (2597, 59)
New val: (239, 59), Old val: (239, 59), New test: (23925, 59), Old test: (957, 59)
New val: (322, 59), Old val: (323, 59), New test: (32200, 59), Old test: (1287, 59)
Are you sure that the mismatch is not related to the model reuse? https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1707367042
The mismatch is definitely due to an inconsistency in the performance_eval.py
script.
It is happening due to the fact that even the test data has been incorrectly used in the previous implementation. To clarify, I will attempt to explain the process through the code itself. I am typing that out in the next comment.
# 1: The model is initialized here
model_ = model()
# 2: the model hyper-params are set here
if model_params is not None:
model_.set_params(model_params)
# 3: As I mentioned earlier, the previous approach did not have a decoupled preprocessing API. My code fixes this.
user_df = model_._clean_data(user_df)
user_df = model_.preprocess_data(user_df)
# 4: The data is split here. Note that here, the terminology should be: train_df, test_df.
temp_df, validation_data = train_test_split(modified_df, test_size=0.2, random_state=49, stratify = modified_df.mode_true)
# Append some data to the temp and validation data.
# 5: Here is where the k-fold training starts:
for _, (train_idx, test_idx) in enumerate(kfolds.split(temp_df, temp_df['mode_true'])):
train_trips = temp_df.iloc[train_idx]
test_trips = temp_df.iloc[test_idx]
# 6: The outer model instance is fit to the train sub-split.
model_.fit(train_trips)
# 7: The validation sub-split is used to generate predictions for test_trips! This is not right!
test_trips.reset_index(drop=True, inplace=True)
pred_df = model_.predict(test_trips)
next_mode_pred = pred_df['mode_pred']
next_purpose_pred = pred_df['purpose_pred']
next_replaced_pred = pred_df['replaced_pred']
# 8: Intermediate values are appended to global lists here.
mode_pred += list(next_mode_pred)
purpose_pred += list(next_purpose_pred)
replaced_pred += list(next_replaced_pred)
distance += list(test_trips.distance)
duration += list(test_trips.duration)
# The fold number is recorded here.
fold_number_List += [fold_number] * len(test_trips)
fold_number += 1
# 9: The SAME outer model instance is used to fit on the ENTIRE training data!
model_.fit(temp_df)
# 10: Now, the validation data is inferred.
validation_data.reset_index(drop=True, inplace=True)
pred_df_vaild = model_.predict(validation_data)
validation_data['mode_pred'] = pred_df_vaild['mode_pred']
# returns a dict of info here
Basically, the mode_pred
column is derived from the validation data sub-split. Ideally, every split should just be discarded as soon as the validation metrics are computed. However, the previous method treats the samples from these splits as testing data. This is why I was worried of data leakage - the model has seen the test data because we re-fit the same model on the entire temp_df
(comment 9)
@shankari I plan to resume investigating this first thing tomorrow. It's 7:10 p.m. here, so I'm clocking out for the day.
Just to clarify, I do not think that there is any model leakage at the high-level conceptual stage. However, I agree that this implementation is flawed.
temp_df, validation_data = train_test_split(modified_df, test_size=0.2, random_state=49, stratify = modified_df.mode_true)
Just curious: what is modified_df
here? I don't see that in your code snippet. it doesn't really matter for your comments, but I would like to know what it was before we modified it :)
# 7: The validation sub-split is used to generate predictions for test_trips! This is not right!
I don't think your interpretation is correct here. We are predicting for test_trips
, which we get from test_trips = temp_df.iloc[test_idx]
, and temp_df
is the non-validation (aka train + test) set. We are not touching the validation_data
here.
# 4: The data is split here. Note that here, the terminology should be: train_df, test_df.
But she is using this to split into validation and non-validation. So the terminology should be train_test_df, validation_df
. Note also that the validation
here is not really for validation, but for evaluation of the estimated value and variance. So it should really be train_test_df, evaluation_df
# 9: The SAME outer model instance is used to fit on the ENTIRE training data! # 6: The outer model instance is fit to the train sub-split.
These two are the key problems IMHO. For 9, it is not just entire training data, it is entire training + test data.
wrt https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1707224696
I can see the case for "A scatter plot of F1 scores w.r.t. the number of trips taken" but note that our goal here is not to have a high-performing model. Our goal is to quantify the uncertainty of the metrics (counts/distances) given a model and its error characteristics.
So while it would be cool from a purely intellectual perspective to understand the performance of the model, I think it is much more important to move on to applying the multinomial distribution method to the new model results given the known errors with the old method.
I would encourage you to NOT treat the old method is not a reference/baseline - it was written by Wen in ~ 2 months and has not really been reviewed by anybody.
wrt https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1707252597
I would suggest prioritizing the following tasks:
We already know that the counts differ, but do the normalized values also differ significantly?
Run Wen's code to compute the variance counts using both the predictions and see if there's a major difference in the numbers.
Yes, we want to get a quick intuition for this before expanding to the full dataset and/or working on accuracy and/or sensitivity analyses
Just curious: what is modified_df here? I don't see that in your code snippet. it doesn't really matter for your comments, but I would like to know what it was before we modified it :)
It has remained unchanged, Dr. Shankari. I haven't changed the naming convention at all. I was also very confused when I read the code first. My assumption is that the author named this modified_df
because they wanted to self-conceptualize that it was data that had some values stripped from it here:
# Remove the single classes from the original DataFrame
modified_df = user_df[~user_df['mode_true'].isin(single_count)]
We then extract a series of data points that have a single count and randomly append them to either the temp_df
or validation_df
based on a p=[0.8,0.2]
likelihood.
I don't think your interpretation is correct here. We are predicting for test_trips, which we get from test_trips = temp_df.iloc[test_idx], and temp_df is the non-validation (aka train + test) set. We are not touching the validation_data here.
That is correct, but I'm a little perplexed w.r.t. the terminology used here. Stating the obvious here, but validation data refers to the fold of hold-out data that we're allowed to peek and fine-tune our model. Test data, however, is the part that we should only use once for inference. We cannot look at the test data and tune our HPs to optimize the performance, for that would be 'cheating'.
Here, validation data refers to the test data and the test data refers to the validation data, which is why I feared data leakage. I think it would be easier for me to just view this as a train-test spllt and internally map train:temp_df
and test:validation_df
.
But she is using this to split into validation and non-validation. So the terminology should be train_test_df, validation_df. Note also that the validation here is not really for validation, but for evaluation of the estimated value and variance. So it should really be train_test_df, evaluation_df
Yes, agreed. This terminology makes much more sense to me as well :)
I would suggest prioritizing the following tasks:
Check the count-based confusion matrices side-by-side: do you see any difference? Has the precision/recall changed significantly? We already know that the counts differ, but do the normalized values also differ significantly?
Run Wen's code to compute the variance counts using both the predictions and see if there's a major difference in the numbers. Yes, we want to get a quick intuition for this before expanding to the full dataset and/or working on accuracy and/or sensitivity analyses
Noted. I shall focus on comparing the confusion matrices (normalized by true predictions) and see if the values change significantly. As recommended, I shall also keep my focus on the 9 candidates and cross-verify the results first before suggesting any more modeling and/or pipeline changes.
Ah, I overlooked this! The reason why the current method has WAY more observations in the test data as compared to the previous implementation is because we're running a nested loop in the current implementation! We only run the k-fold loop in the previous implementation, whereas now we're running the k-fold loop for every hyper-parameter configuration!
14475/579 = 25; 46275/1851 = 25
. We sampled n=25
hyper-parameters, so this confirms why we're getting such a large number of observations!
Hmm, so what should the ideal strategy be here? Here's what I'm thinking:
Just to be extremely clear here:
but validation data refers to the fold of hold-out data that we're allowed to peek and fine-tune our model.
Don't use the variable names to map to your prior work with ML. The validation_data
is not being used to fine-tune the model. Instead, it is being used to compute the expected value and the variance for that set of trips, using the confusion matrix from the train/test sets so that we can compare it to the ground truth and see if the ground truth falls within the uncertainty.
Think about our actual problem (which is not very common in ML and may require you to think more deeply than applying existing techniques), and then map it to the code. Feel free to change the variable names in your implementation.
We then extract a series of data points that have a single count and randomly append them to either the temp_df or validation_df based on a p=[0.8,0.2] likelihood.
What is single_count
and why are we stripping out entries that have it?
Just as we chose the best hyper-parameters using argmax, we use argmax to obtain the test results for those settings and ONLY return those test results.
For quick and dirty checking, I would go with this. Note again that our goal is not to build the best model. It is to estimate the uncertainty given a model and the error characteristics of the model. The model should be correct, obviously, with no leakage. But it doesn't have to be super good.
single_count
is a data frame of all the entries that only have a single class entry. The authors obtain this using value_counts()
on the mode_true
column and filtering the row indices with value = 1
. Once we split the modified_df
into the temp
and validation
splits, we randomly assign each entry in the single_count
dataframe to either the temp
or validation
data with a (0.8, 0.2) probability respectively.
As for the reason why these values are stripped, I do not understand completely. It would make sense to completely remove these entries, but the author seems to have chosen to salvage them.
single_count
is a data frame of all the entries that only have a single class entry As for the reason why these values are stripped, I do not understand completely.
I think that this is to strip out custom entries (e.g. the user entered kayak
manually). I don't know that it needs to be stripped, but I guess it doesn't hurt. You should check against the code in TRB_label_assist
to see if this was copied over from there.
It would make sense to completely remove these entries, but the author seems to have chosen to salvage them.
The author is completely removing them user_df[~user_df['mode_true'].isin(single_count)]
. But in step 2 above, we are still working with user_df
, so I am not sure where modified_df
is being pre-processed, for example.
I just checked the original TRB_label_assist
code here, but interestingly, I don't see the k_fold_cross_val_predict
method here. Wen might have added that in, then. Interestingly, the code in TRB_label_assist
actually DOES initialize a new model inside the k-fold iteration. Also, there is no train-test
split used before the k-fold loop.
The author is completely removing them
user_df[~user_df['mode_true'].isin(single_count)]
Yes, but we do append the discarded rows back into either validation or temp here:
for _, row in single_count_df.iterrows():
random_set = np.random.choice(["temp", "validation"], p=[0.8, 0.2])
if random_set == "temp":
temp_df = temp_df.append(row)
else:
validation_data = validation_data.append(row)
user_df = preprocess(user_df)
# find single counts and remove them from user_df to create modified_df
# temp-val split on modified_df
# append discarded single counts to either temp or val df
If it would be easier, I would be supportive of just abandoning Wen's code and starting with Hannah's codebase to re-implement the code to compare estimated values + uncertainty against true values from scratch.
I don't think that the code was very good quality, and we might be spending more time than it is worth looking through and trying to understand her code. I do think that Hannah's code is correct and is much easier to work with as well.
@humbleOldSage is also working on that codebase, so you can have a partner who can read through the code with you as well.
Sure. I am almost done with the current script. I figured out a better way to retrieve the test scores corresponding to the best hyper-parameter settings. I agree, however - the codebase is a little finicky to work with. I can coordinate with @humbleOldSage and go through the Hannah's code instead.
Status update: Got everything to work as expected. Currently re-training the models; 5/9 users done, 29 minutes elapsed.
Edit: Training is complete. I was reading Hannah's TRB_label_assist
while the operation completed.
The test F1 has also been updated:
I know we're not supposed to compare with Wen's code, but I just wished to demonstrate the change:
wait, how are you consistently getting a test F1 of close to one? That seems a bit suspicious, almost like overfitting.
Edit: Hmm, this is very suspicious. The test F1s should NOT be this high. There's got to be an issue somewhere. I'm double-checking for errors.
@shankari Yeah, I just typed that out too. Definitely shows signs of overfitting.
OH, I know what must've happened! My test prediction strategy is flawed. I need to correct that.
Here's what I was doing:
hp_list = list()
for every hp in hp_config:
test_ix = list()
for every (train_idx, test_idx) in kfold:
test_ix += [test_idx]
# compute f1
# hp_list.append({'hp': hp, 'test_index': test_ix})
# pick best hp and test indices
The inherent issue here is that the test data is being saved for later. When we create the final model and fit to temp_df
, this test data is already a part of temp
! So we're DEFINITELY overfitting. 😢
Fixed. Should work properly now.
Fix: Instead of saving the test idx, make the prediction and save the prediction dataframe. That way, even when the model is discarded, we still have the test predictions.
@shankari I have kept the training active whilst I'm leaving for classes. I will be at the University till 7:30 p.m. EST, and I shall pick up where I left off when I return.
Uploading the new confusion matrices. The results are definitely in line now.
Also uploading the count-normalized confusion matrices. Since I'm using the inbuilt sklearn method to display confusion matrices, I don't have a lot of control on how the numbers in the CM should be formatted. So instead of a 22 grid of CMs, I will be uploading a 41 grid instead; each having it's own collapsible view.
Will continue further investigations tomorrow morning
At first glance, I don't see large differences between the normalized CMs. Unless there are errors in computing the variances, I anticipate that you will get results that are fairly similar to Wen's. I do want to see if those results vary by input size, though
Thank you for the feedback, Dr. Shankari. I am currently running Wen's notebook with the current results and seeing if the variance count changes drastically - I agree with your assumption that the results won't change drastically.
I do want to see if those results vary by input size, though
@shankari Would you please expand on this? I'm unsure as to what that means.
I do want to see if those results vary by input size, though
"Is the variance count needed dependent on the number of trips?"
We have users in three "buckets" wrt number of trips. I spot checked one user in each bucket yesterday and it looks like the users with more trips have more consistency between the test and evaluation CMs. So I would expect that their variance count will be lower. That would then indicate that this method only works for for users with sufficient trips.
We need to discuss whether this is "sufficient labeled trips" (bad) or "sufficient inferred trips" (good). I think it is the latter, but need somebody to think more deeply about it and potentially run some experiments :)
Computed the variance counts for the 9 users for both the previous and current implementations:
Previous implementation:
Current implementation:
Observations:
@shankari Any other observations that you observe here?
@rahulkulhalli 1.Given the issues with this implementation, I would want to first make sure that the plots are being computed correctly.
Agreed that the variance count spread is much better controlled with the new implementation. But it is still 2x variance, while 2x SD would be a better goal for a well characterized uncertainty.
Understood. I will work with Michael to double-check the validity of the plots. That, or I could push my code to my own branch so that you may also peruse through the implementation once.
I would also like to see number of trips vs. variance count
Yes, I am currently implementing those plots, Dr. Shankari.
Are any of the values with a variance (not variance count) < 1?
Yes, there are some values (in both the implementations) with variance < 1. I am attaching the results below. Some cells may be NaN because the specific users don't have those labels.
Previous implementation:
Current implementation:
Each user has a spread of values - are the outliers the ones where we have few trips for that mode?
Yes, I just cross-checked. However, it isn't necessarily the mode with the FEWEST number of trips.
Adding on to the last question - most of the outlier modes are the same across both the implementations except for two users. For user cba570ae-38f3-41fa-a625-7342727377b7, the outlier mode for the previous implementation is that of tow_truck
, with a single trip instance. However, for the current implementation, the mode now becomes Walk
, with 8 trip instances.
For user c7ce889c-796f-4e2a-8859-fa2d7d5068fe, the previous implementation's outlier mode is paddle_board
, with a single trip instance. In the new implementation, the outlier mode is i_dont_know
, with 4 trip instances.
I will work with Michael to double-check the validity of the plots.
I don't think Michael was involved in creating those plots, but sure, you could ask him to review them.
I am attaching the results below. Some cells may be NaN because the specific users don't have those labels.
What do those results show? What is the number in each cell.
For user cba570ae-38f3-41fa-a625-7342727377b7, the outlier mode for the previous implementation is that of tow_truck, with a single trip instance. However, for the current implementation, the mode now becomes Walk, with 8 trip instances.
I don't think that comparisons to the previous implementation are super relevant any more. What we really need to do is to understand the characteristics of the current implementation. For example, for walk
with 8 trip instances, what is the actual variance, what is the SD, what is the total number of trips...
This is where the exploratory analysis starts - I would like you to explore this and understand the reasons for the difference being within a certain number of variance counts and bring them back to me.
A continuation of the conversation regarding the evaluation of Wen's notebook.
Since I do not have write access in Wen's repository, I cannot move the issue here directly.
Until I can find a way to move the previous conversations here, the previous conversation may be found here: https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/issues/1