e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 32 forks source link

Investigating the high variance counts for certain users and modes in label assist #951

Open rahulkulhalli opened 10 months ago

rahulkulhalli commented 10 months ago

A continuation of the conversation regarding the evaluation of Wen's notebook.

Since I do not have write access in Wen's repository, I cannot move the issue here directly.

Until I can find a way to move the previous conversations here, the previous conversation may be found here: https://github.com/WenZhang-Vivien/downstream_matrices_evaluations/issues/1

rahulkulhalli commented 10 months ago

Also uploading the side-by-side F1 comparison for a better perspective:

Image

shankari commented 10 months ago

@rahulkulhalli can you clarify what "old test" is here?

rahulkulhalli commented 10 months ago

Success! I was able to (somewhat) consolidate all the CMs together

CM comparison ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/b5924ea7-96e8-4d86-b614-05d4971e1c0f) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/2982a9a3-e712-4161-a126-45578099dfbe) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/873d783e-5bb2-4022-a702-0e282a3ded37) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/d506a140-1939-41aa-8e3a-cc2e2862e1bf) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/171939a4-b5bd-4023-9d53-ace66e81c0d7) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/bc215699-a5dc-4dd7-8dfd-a8d3671206f6) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/629e51a3-3e51-47f1-a5d4-596193fcf01b) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/6109994d-732b-402d-80ce-45eaf8ce69dd) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/6700f229-148d-4014-b5c8-a8e336dc44e8)
rahulkulhalli commented 10 months ago

@shankari "old" is the previously-used modeling strategy (before fixing the hyper-parameter search and per-user model). I agree that it's a misnomer

rahulkulhalli commented 10 months ago

I can immediately see that the sum of the values in "old test" (previously-used modeling strategy) and "new test" (altered modeling strategy) are extremely off. Why is that? Let me investigate!

shankari commented 10 months ago

By "previously-used modeling strategy (before fixing the hyper-parameter search and per-user model)", you mean that it is without the fix to stop using the same model in every fold? https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1692154598

Depending on how you are computing the CM, and where the values that are represented in the CM come from, that might account for the difference...

rahulkulhalli commented 10 months ago

Yes, that would be correct. I am comparing the confusion matrices between the two methods.

"old" -> the previous modeling strategy where each fold was fit to the same model instance and THEN the entire training data was also fit to the same model instance

"new" -> the current strategy where we use hyper-parameter search and user-specific models

rahulkulhalli commented 10 months ago

Ah! There's a shape mismatch between the previous test data and the current test data. I'll have to investigate the performance_eval.py script and see where the discrepancy lies.


New val: (145, 59), Old val: (145, 59), New test: (14475, 59), Old test: (579, 59)
New val: (464, 59), Old val: (464, 59), New test: (46275, 59), Old test: (1851, 59)
New val: (84, 59), Old val: (85, 59), New test: (8400, 59), Old test: (335, 59)
New val: (45, 59), Old val: (45, 59), New test: (4425, 59), Old test: (177, 59)
New val: (628, 59), Old val: (629, 59), New test: (62725, 59), Old test: (2508, 59)
New val: (285, 59), Old val: (283, 59), New test: (28225, 59), Old test: (1131, 59)
New val: (649, 59), Old val: (649, 59), New test: (64925, 59), Old test: (2597, 59)
New val: (239, 59), Old val: (239, 59), New test: (23925, 59), Old test: (957, 59)
New val: (322, 59), Old val: (323, 59), New test: (32200, 59), Old test: (1287, 59)
shankari commented 10 months ago

Are you sure that the mismatch is not related to the model reuse? https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1707367042

rahulkulhalli commented 10 months ago

The mismatch is definitely due to an inconsistency in the performance_eval.py script.

It is happening due to the fact that even the test data has been incorrectly used in the previous implementation. To clarify, I will attempt to explain the process through the code itself. I am typing that out in the next comment.

rahulkulhalli commented 10 months ago

    # 1: The model is initialized here
    model_ = model()

    # 2: the model hyper-params are set here
    if model_params is not None:
        model_.set_params(model_params)

    # 3: As I mentioned earlier, the previous approach did not have a decoupled preprocessing API. My code fixes this.
    user_df = model_._clean_data(user_df)
    user_df = model_.preprocess_data(user_df)

    # 4: The data is split here. Note that here, the terminology should be: train_df, test_df.
    temp_df, validation_data = train_test_split(modified_df, test_size=0.2, random_state=49, stratify = modified_df.mode_true)

    # Append some data to the temp and validation data.

    # 5: Here is where the k-fold training starts:
    for _, (train_idx, test_idx) in enumerate(kfolds.split(temp_df, temp_df['mode_true'])):
        train_trips = temp_df.iloc[train_idx]
        test_trips = temp_df.iloc[test_idx]

        # 6: The outer model instance is fit to the train sub-split. 
        model_.fit(train_trips)

        # 7: The validation sub-split is used to generate predictions for test_trips! This is not right!
        test_trips.reset_index(drop=True, inplace=True)
        pred_df = model_.predict(test_trips)
        next_mode_pred = pred_df['mode_pred']
        next_purpose_pred = pred_df['purpose_pred']
        next_replaced_pred = pred_df['replaced_pred']

        # 8: Intermediate values are appended to global lists here.
        mode_pred += list(next_mode_pred)
        purpose_pred += list(next_purpose_pred)
        replaced_pred += list(next_replaced_pred)
        distance += list(test_trips.distance)
        duration += list(test_trips.duration)

        # The fold number is recorded here.
        fold_number_List += [fold_number] * len(test_trips) 
        fold_number += 1

    # 9: The SAME outer model instance is used to fit on the ENTIRE training data!
    model_.fit(temp_df)

    # 10: Now, the validation data is inferred.
    validation_data.reset_index(drop=True, inplace=True)
    pred_df_vaild = model_.predict(validation_data)
    validation_data['mode_pred'] = pred_df_vaild['mode_pred']

    # returns a dict of info here
rahulkulhalli commented 10 months ago

Basically, the mode_pred column is derived from the validation data sub-split. Ideally, every split should just be discarded as soon as the validation metrics are computed. However, the previous method treats the samples from these splits as testing data. This is why I was worried of data leakage - the model has seen the test data because we re-fit the same model on the entire temp_df (comment 9)

rahulkulhalli commented 10 months ago

@shankari I plan to resume investigating this first thing tomorrow. It's 7:10 p.m. here, so I'm clocking out for the day.

shankari commented 10 months ago

Just to clarify, I do not think that there is any model leakage at the high-level conceptual stage. However, I agree that this implementation is flawed.

temp_df, validation_data = train_test_split(modified_df, test_size=0.2, random_state=49, stratify = modified_df.mode_true)

Just curious: what is modified_df here? I don't see that in your code snippet. it doesn't really matter for your comments, but I would like to know what it was before we modified it :)

    # 7: The validation sub-split is used to generate predictions for test_trips! This is not right!

I don't think your interpretation is correct here. We are predicting for test_trips, which we get from test_trips = temp_df.iloc[test_idx], and temp_df is the non-validation (aka train + test) set. We are not touching the validation_data here.

# 4: The data is split here. Note that here, the terminology should be: train_df, test_df.

But she is using this to split into validation and non-validation. So the terminology should be train_test_df, validation_df. Note also that the validation here is not really for validation, but for evaluation of the estimated value and variance. So it should really be train_test_df, evaluation_df

# 9: The SAME outer model instance is used to fit on the ENTIRE training data!
    # 6: The outer model instance is fit to the train sub-split. 

These two are the key problems IMHO. For 9, it is not just entire training data, it is entire training + test data.

shankari commented 10 months ago

wrt https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1707224696

I can see the case for "A scatter plot of F1 scores w.r.t. the number of trips taken" but note that our goal here is not to have a high-performing model. Our goal is to quantify the uncertainty of the metrics (counts/distances) given a model and its error characteristics.

So while it would be cool from a purely intellectual perspective to understand the performance of the model, I think it is much more important to move on to applying the multinomial distribution method to the new model results given the known errors with the old method.

I would encourage you to NOT treat the old method is not a reference/baseline - it was written by Wen in ~ 2 months and has not really been reviewed by anybody.

wrt https://github.com/e-mission/e-mission-docs/issues/951#issuecomment-1707252597

I would suggest prioritizing the following tasks:

  1. Check the count-based confusion matrices side-by-side: do you see any difference? Has the precision/recall changed significantly?

We already know that the counts differ, but do the normalized values also differ significantly?

Run Wen's code to compute the variance counts using both the predictions and see if there's a major difference in the numbers.

Yes, we want to get a quick intuition for this before expanding to the full dataset and/or working on accuracy and/or sensitivity analyses

rahulkulhalli commented 10 months ago

Just curious: what is modified_df here? I don't see that in your code snippet. it doesn't really matter for your comments, but I would like to know what it was before we modified it :)

It has remained unchanged, Dr. Shankari. I haven't changed the naming convention at all. I was also very confused when I read the code first. My assumption is that the author named this modified_df because they wanted to self-conceptualize that it was data that had some values stripped from it here:

# Remove the single classes from the original DataFrame
modified_df = user_df[~user_df['mode_true'].isin(single_count)]

We then extract a series of data points that have a single count and randomly append them to either the temp_df or validation_df based on a p=[0.8,0.2] likelihood.

rahulkulhalli commented 10 months ago

I don't think your interpretation is correct here. We are predicting for test_trips, which we get from test_trips = temp_df.iloc[test_idx], and temp_df is the non-validation (aka train + test) set. We are not touching the validation_data here.

That is correct, but I'm a little perplexed w.r.t. the terminology used here. Stating the obvious here, but validation data refers to the fold of hold-out data that we're allowed to peek and fine-tune our model. Test data, however, is the part that we should only use once for inference. We cannot look at the test data and tune our HPs to optimize the performance, for that would be 'cheating'.

Here, validation data refers to the test data and the test data refers to the validation data, which is why I feared data leakage. I think it would be easier for me to just view this as a train-test spllt and internally map train:temp_df and test:validation_df.

rahulkulhalli commented 10 months ago

But she is using this to split into validation and non-validation. So the terminology should be train_test_df, validation_df. Note also that the validation here is not really for validation, but for evaluation of the estimated value and variance. So it should really be train_test_df, evaluation_df

Yes, agreed. This terminology makes much more sense to me as well :)

rahulkulhalli commented 10 months ago

I would suggest prioritizing the following tasks:

Check the count-based confusion matrices side-by-side: do you see any difference? Has the precision/recall changed significantly? We already know that the counts differ, but do the normalized values also differ significantly?

Run Wen's code to compute the variance counts using both the predictions and see if there's a major difference in the numbers. Yes, we want to get a quick intuition for this before expanding to the full dataset and/or working on accuracy and/or sensitivity analyses

Noted. I shall focus on comparing the confusion matrices (normalized by true predictions) and see if the values change significantly. As recommended, I shall also keep my focus on the 9 candidates and cross-verify the results first before suggesting any more modeling and/or pipeline changes.

rahulkulhalli commented 10 months ago

Ah, I overlooked this! The reason why the current method has WAY more observations in the test data as compared to the previous implementation is because we're running a nested loop in the current implementation! We only run the k-fold loop in the previous implementation, whereas now we're running the k-fold loop for every hyper-parameter configuration!

14475/579 = 25; 46275/1851 = 25. We sampled n=25 hyper-parameters, so this confirms why we're getting such a large number of observations!

rahulkulhalli commented 10 months ago

Hmm, so what should the ideal strategy be here? Here's what I'm thinking:

  1. For every setting of the hyper-parameters, save the corresponding test predictions.
  2. Just as we chose the best hyper-parameters using argmax, we use argmax to obtain the test results for those settings and ONLY return those test results.
shankari commented 10 months ago

Just to be extremely clear here:

but validation data refers to the fold of hold-out data that we're allowed to peek and fine-tune our model.

Don't use the variable names to map to your prior work with ML. The validation_data is not being used to fine-tune the model. Instead, it is being used to compute the expected value and the variance for that set of trips, using the confusion matrix from the train/test sets so that we can compare it to the ground truth and see if the ground truth falls within the uncertainty.

Think about our actual problem (which is not very common in ML and may require you to think more deeply than applying existing techniques), and then map it to the code. Feel free to change the variable names in your implementation.

We then extract a series of data points that have a single count and randomly append them to either the temp_df or validation_df based on a p=[0.8,0.2] likelihood.

What is single_count and why are we stripping out entries that have it?

shankari commented 10 months ago

Just as we chose the best hyper-parameters using argmax, we use argmax to obtain the test results for those settings and ONLY return those test results.

For quick and dirty checking, I would go with this. Note again that our goal is not to build the best model. It is to estimate the uncertainty given a model and the error characteristics of the model. The model should be correct, obviously, with no leakage. But it doesn't have to be super good.

rahulkulhalli commented 10 months ago

single_count is a data frame of all the entries that only have a single class entry. The authors obtain this using value_counts() on the mode_true column and filtering the row indices with value = 1. Once we split the modified_df into the temp and validation splits, we randomly assign each entry in the single_count dataframe to either the temp or validation data with a (0.8, 0.2) probability respectively.

As for the reason why these values are stripped, I do not understand completely. It would make sense to completely remove these entries, but the author seems to have chosen to salvage them.

shankari commented 10 months ago

single_count is a data frame of all the entries that only have a single class entry As for the reason why these values are stripped, I do not understand completely.

I think that this is to strip out custom entries (e.g. the user entered kayak manually). I don't know that it needs to be stripped, but I guess it doesn't hurt. You should check against the code in TRB_label_assist to see if this was copied over from there.

It would make sense to completely remove these entries, but the author seems to have chosen to salvage them.

The author is completely removing them user_df[~user_df['mode_true'].isin(single_count)]. But in step 2 above, we are still working with user_df, so I am not sure where modified_df is being pre-processed, for example.

rahulkulhalli commented 10 months ago

I just checked the original TRB_label_assist code here, but interestingly, I don't see the k_fold_cross_val_predict method here. Wen might have added that in, then. Interestingly, the code in TRB_label_assist actually DOES initialize a new model inside the k-fold iteration. Also, there is no train-test split used before the k-fold loop.

The author is completely removing them user_df[~user_df['mode_true'].isin(single_count)]

Yes, but we do append the discarded rows back into either validation or temp here:

    for _, row in single_count_df.iterrows():
        random_set = np.random.choice(["temp", "validation"], p=[0.8, 0.2])
        if random_set == "temp":
            temp_df = temp_df.append(row)
        else:
            validation_data = validation_data.append(row)
user_df = preprocess(user_df)
# find single counts and remove them from user_df to create modified_df
# temp-val split on modified_df
# append discarded single counts to either temp or val df
shankari commented 10 months ago

If it would be easier, I would be supportive of just abandoning Wen's code and starting with Hannah's codebase to re-implement the code to compare estimated values + uncertainty against true values from scratch.

I don't think that the code was very good quality, and we might be spending more time than it is worth looking through and trying to understand her code. I do think that Hannah's code is correct and is much easier to work with as well.

@humbleOldSage is also working on that codebase, so you can have a partner who can read through the code with you as well.

rahulkulhalli commented 10 months ago

Sure. I am almost done with the current script. I figured out a better way to retrieve the test scores corresponding to the best hyper-parameter settings. I agree, however - the codebase is a little finicky to work with. I can coordinate with @humbleOldSage and go through the Hannah's code instead.

rahulkulhalli commented 10 months ago

Status update: Got everything to work as expected. Currently re-training the models; 5/9 users done, 29 minutes elapsed.

Edit: Training is complete. I was reading Hannah's TRB_label_assist while the operation completed.

rahulkulhalli commented 10 months ago
Updated raw count confusion matrices ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/db348869-019f-43bc-884c-3a5501023db9) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/52daf0aa-b71e-48b2-9a9c-4c05ec78728f) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/30cd606e-caab-440a-b500-3c2bc6092655) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/91cf7d48-6875-43cc-ab77-3d99884948af) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/e06b8d88-3a08-4086-a9ee-83996791f059) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/1db0403f-774c-465a-9ecc-6dd1bb1dd937) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/881c592c-2bda-4bab-a533-4f33369d7336) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/a361e430-64cf-458c-bbca-511eec60ce92) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/b380626a-1bf9-4525-b346-4ae1b671d964)
rahulkulhalli commented 10 months ago

The test F1 has also been updated:

I know we're not supposed to compare with Wen's code, but I just wished to demonstrate the change:

Image

shankari commented 10 months ago

wait, how are you consistently getting a test F1 of close to one? That seems a bit suspicious, almost like overfitting.

rahulkulhalli commented 10 months ago

Edit: Hmm, this is very suspicious. The test F1s should NOT be this high. There's got to be an issue somewhere. I'm double-checking for errors.

rahulkulhalli commented 10 months ago

@shankari Yeah, I just typed that out too. Definitely shows signs of overfitting.

OH, I know what must've happened! My test prediction strategy is flawed. I need to correct that.

Here's what I was doing:


hp_list = list()
for every hp in hp_config:
    test_ix = list()
    for every (train_idx, test_idx) in kfold:
        test_ix += [test_idx]

        # compute f1
        # hp_list.append({'hp': hp, 'test_index': test_ix})

# pick best hp and test indices

The inherent issue here is that the test data is being saved for later. When we create the final model and fit to temp_df, this test data is already a part of temp! So we're DEFINITELY overfitting. 😢

rahulkulhalli commented 10 months ago

Fixed. Should work properly now.

Fix: Instead of saving the test idx, make the prediction and save the prediction dataframe. That way, even when the model is discarded, we still have the test predictions.

@shankari I have kept the training active whilst I'm leaving for classes. I will be at the University till 7:30 p.m. EST, and I shall pick up where I left off when I return.

rahulkulhalli commented 10 months ago

Uploading the new confusion matrices. The results are definitely in line now.

Image

rahulkulhalli commented 10 months ago
Updated confusion matrices ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/f0de2603-d417-4ddd-ab08-d844b959e7d2) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/781128a0-94e2-451e-a8df-32e5c9e887bc) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/d2f91626-2873-4e59-9d8d-391b1f1fd3fc) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/8dbb2733-891c-45d2-ad05-4c023b5083ef) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/d77996b0-4b82-4bb9-9810-14bb1644cb37) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/ac4ebf22-dcbc-4db7-97de-a4a81c7194e1) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/39955b61-7fb5-402a-b015-647c26dd4e05) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/c1b97a5a-0cad-463e-b220-07e5ab08d55d) ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/73443c02-96e2-4e2a-a15d-0dccf5ade7e1)
rahulkulhalli commented 10 months ago

Also uploading the count-normalized confusion matrices. Since I'm using the inbuilt sklearn method to display confusion matrices, I don't have a lot of control on how the numbers in the CM should be formatted. So instead of a 22 grid of CMs, I will be uploading a 41 grid instead; each having it's own collapsible view.

rahulkulhalli commented 10 months ago
0b3e78fa-91d8-4aa6-a320-3440143c8c16 ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/78f05984-5838-4d5f-b43c-491d4fe74602)

00db212b-c8d0-44cd-8392-41ab4065e603 ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/c16efb0b-1966-4033-ad25-eb9505eb1167)

1da31f30-f183-4fc5-bca5-a1bee71072bb ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/0a07defa-d8f7-4f4b-adc2-e666c27d543b)

355e25bd-fc24-4c5e-85d3-58e39432bd44 ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/465a9700-3e99-47eb-b7a4-a9eda9b76ccb)

rahulkulhalli commented 10 months ago
bf774cbe-6c30-40b0-a022-278d36a23f19 ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/423f72ae-d56c-4d69-bcc1-c2dd39ca4770)

bf776197-ee89-4183-8a0a-04c7fa7228e2 ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/ab835a97-26be-476a-9ba3-2561e199fc61)

c7ce889c-796f-4e2a-8859-fa2d7d5068fe ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/e5435207-f3a2-4cf6-a04d-a95ff56a5081)

cba570ae-38f3-41fa-a625-7342727377b7 ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/d9ede575-6492-41cb-b01d-4b1490e51362)

ece8b0a5-0953-4e98-a0d3-69f25de4a206 ![Image](https://github.com/e-mission/e-mission-docs/assets/17728123/c71f5479-970f-431f-a6fe-2df4810e0a0b)
rahulkulhalli commented 10 months ago

Will continue further investigations tomorrow morning

shankari commented 10 months ago

At first glance, I don't see large differences between the normalized CMs. Unless there are errors in computing the variances, I anticipate that you will get results that are fairly similar to Wen's. I do want to see if those results vary by input size, though

rahulkulhalli commented 10 months ago

Thank you for the feedback, Dr. Shankari. I am currently running Wen's notebook with the current results and seeing if the variance count changes drastically - I agree with your assumption that the results won't change drastically.

I do want to see if those results vary by input size, though

@shankari Would you please expand on this? I'm unsure as to what that means.

shankari commented 10 months ago

I do want to see if those results vary by input size, though

"Is the variance count needed dependent on the number of trips?"

We have users in three "buckets" wrt number of trips. I spot checked one user in each bucket yesterday and it looks like the users with more trips have more consistency between the test and evaluation CMs. So I would expect that their variance count will be lower. That would then indicate that this method only works for for users with sufficient trips.

We need to discuss whether this is "sufficient labeled trips" (bad) or "sufficient inferred trips" (good). I think it is the latter, but need somebody to think more deeply about it and potentially run some experiments :)

rahulkulhalli commented 10 months ago

Computed the variance counts for the 9 users for both the previous and current implementations:

Previous implementation:

Image

Current implementation:

Image

rahulkulhalli commented 10 months ago

Observations:

  1. No data points where variance count > 2.5 in the current method, one data point with variance count = 9.66.
  2. The x-axis shows that the the current implementation shows that the accuracy spread is much better controlled than that of the previous implementation.

@shankari Any other observations that you observe here?

shankari commented 10 months ago

@rahulkulhalli 1.Given the issues with this implementation, I would want to first make sure that the plots are being computed correctly.

  1. As I said earlier, I would also like to see number of trips vs. variance count

Agreed that the variance count spread is much better controlled with the new implementation. But it is still 2x variance, while 2x SD would be a better goal for a well characterized uncertainty.

  1. Are any of the values with a variance (not variance count) < 1? In that case, SD > variance so the variance count >> SD count.
  2. Each user has a spread of values - are the outliers the ones where we have few trips for that mode?
rahulkulhalli commented 10 months ago

Understood. I will work with Michael to double-check the validity of the plots. That, or I could push my code to my own branch so that you may also peruse through the implementation once.

I would also like to see number of trips vs. variance count

Yes, I am currently implementing those plots, Dr. Shankari.

Are any of the values with a variance (not variance count) < 1?

Yes, there are some values (in both the implementations) with variance < 1. I am attaching the results below. Some cells may be NaN because the specific users don't have those labels.

Previous implementation:

Image

Current implementation:

Image

Each user has a spread of values - are the outliers the ones where we have few trips for that mode?

Yes, I just cross-checked. However, it isn't necessarily the mode with the FEWEST number of trips.

rahulkulhalli commented 10 months ago

Adding on to the last question - most of the outlier modes are the same across both the implementations except for two users. For user cba570ae-38f3-41fa-a625-7342727377b7, the outlier mode for the previous implementation is that of tow_truck, with a single trip instance. However, for the current implementation, the mode now becomes Walk, with 8 trip instances. For user c7ce889c-796f-4e2a-8859-fa2d7d5068fe, the previous implementation's outlier mode is paddle_board, with a single trip instance. In the new implementation, the outlier mode is i_dont_know, with 4 trip instances.

shankari commented 10 months ago

I will work with Michael to double-check the validity of the plots.

I don't think Michael was involved in creating those plots, but sure, you could ask him to review them.

I am attaching the results below. Some cells may be NaN because the specific users don't have those labels.

What do those results show? What is the number in each cell.

For user cba570ae-38f3-41fa-a625-7342727377b7, the outlier mode for the previous implementation is that of tow_truck, with a single trip instance. However, for the current implementation, the mode now becomes Walk, with 8 trip instances.

I don't think that comparisons to the previous implementation are super relevant any more. What we really need to do is to understand the characteristics of the current implementation. For example, for walk with 8 trip instances, what is the actual variance, what is the SD, what is the total number of trips...

This is where the exploratory analysis starts - I would like you to explore this and understand the reasons for the difference being within a certain number of variance counts and bring them back to me.