e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

⚗️inferred sections may not be linked to their corresponding trips #927

Open shankari opened 1 year ago

shankari commented 1 year ago

While testing the deployment of https://github.com/e-mission/e-mission-server/pull/917 to stage, I found that the inferred section summaries were filled in but the cleaned section summaries were not.

'cleaned_trip': ObjectId('6498f494a9956a43810c2caf'), 'inferred_section_summary': {'distance': {}, 'duration': {}, 'count': {}}, 'cleaned_section_summary': {'distance': {'ON_FOOT': 407.50089074389757}, 'duration': {'ON_FOOT': 714.4223742485046}, 'count': {'ON_FOOT': 1}}

That seemed weird, since there did not appear to be any errors while generating the mode inference.

I then looked for the matching sections and there is indeed one cleaned section and no inferred sections.

>>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/inferred_section", "data.trip_id": boi.ObjectId("6498f494a9956a43810c2caf")})
0
>>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/cleaned_section", "data.trip_id": boi.ObjectId("6498f494a9956a43810c2caf")})
1

However, if I try to find one inferred section, and

>>> edb.get_analysis_timeseries_db().find_one({"metadata.key": "analysis/inferred_section"})
{'_id': ObjectId('645cc2b2e73ed3debd231027'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df025ea199f1d0473bb4b'),

and find the set of sections for that trip, I find

>>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/inferred_section", "data.trip_id": boi.ObjectId("644df025ea199f1d0473bb4b")})
807

which looks wrong.

We may be messing up the inferred section to trip mappings. Need to investigate further.

shankari commented 1 year ago

Ah ok, so we deployed from master instead of GIS. Still doesn't help us figure out if the previous entry was incorrect or not, but let's recreate for now

2023-06-26 02:14:49,934:DEBUG:140505714460480:orig_ts_db_matches = 0, analysis_ts_db_matches = 5
2023-06-26 02:14:49,959:DEBUG:140505714460480:Returning entry with length 5 result
2023-06-26 02:14:49,960:ERROR:140505714460480:Error while inferring modes, timestamp is unchanged
Traceback (most recent call last):
File "/usr/src/app/emission/analysis/classification/inference/mode/pipeline.py", line 41, in predict_mode
mip.runPredictionPipeline(user_id, time_query)
File "/usr/src/app/emission/analysis/classification/inference/mode/pipeline.py", line 139, in runPredictionPipeline
self.loadModelStage()
File "/usr/src/app/emission/analysis/classification/inference/mode/pipeline.py", line 156, in loadModelStage
self.model = seedp.ModeInferencePipelineMovesFormat.loadModel()
File "/usr/src/app/emission/analysis/classification/inference/mode/seed/pipeline.py", line 93, in loadModel
fd = open(SAVED_MODEL_FILENAME, "r")
FileNotFoundError: [Errno 2] No such file or directory: 'seed_model.json'
shankari commented 1 year ago

I still don't understand the "807" matches for that one trip, but let's investigate that later

shankari commented 1 year ago

This also brings up another issue. In the pipeline in general, we don't create objects in later stages if an earlier stage fails. The confirmed objects are an exception, because we try to use the inferred sections but then fall back to cleaned sections if they exist. But then once the pipeline is fixed, and the inferred sections are in fact filled in, we will not (with the current implementation) go back and fill in the confirmed trip.

Need to figure out how to fix that.

shankari commented 1 year ago

For now, reset the pipeline for this single user. After redeploying from the GIS branch, everything works

'inferred_section_summary': {'distance': {'WALKING': 403.4288117669756}, 'duration': {'WALKING': 641.4911608695984}, 'count': {'WALKING': 1}},
'cleaned_section_summary': {'distance': {'ON_FOOT': 403.4288117669756}, 'duration': {'ON_FOOT': 641.4911608695984}, 'count': {'ON_FOOT': 1}}

Launching the migration script and calling it a day for now

shankari commented 1 year ago

I have stashed changes and moved the following files

    bin/debug/delete_composite_objects_and_state.py
    emission/storage/json_wrappers.py
shankari commented 1 year ago

Launching the migration script and calling it a day for now

Ran into several errors during the migration

Found error ObjectId('62d16908048ceaa2b4ee5769') while processing pipeline for user 0951d756-b17a-4a33-aa90-a44830928a9e, check log files for details
Found error ObjectId('634854f22c03517c520aaff0') while processing pipeline for user af046be2-3593-42b3-b2e5-ea4ba8287bbf, check log files for details
Found error ObjectId('62e8876c9f3797eb9d1c0b31') while processing pipeline for user b74def1e-a134-4c81-bffc-616c8a72639d, check log files for details

The error is apparently that there is no matching composite trip for a particular confirmed trip.

ERROR:root:Found error ObjectId('62daf6770fd38d61e7e9a884') while processing pipeline for user fdcb2d34-c8e8-4
d5a-8d98-e0e2d397e490, skipping
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/nrel-db-connect/bin/historical/migrations/add_sections_and_summaries_to_trip
s.py", line 27, in add_sections_to_trips
    add_sections_to_trips_for_user(uuid)
  File "/Users/kshankar/e-mission/nrel-db-connect/bin/historical/migrations/add_sections_and_summaries_to_trip
s.py", line 45, in add_sections_to_trips_for_user
    matching_composite_trip = composite_trips_map[t["_id"]]
KeyError: ObjectId('62daf6770fd38d61e7e9a884')

Need to see where we broke those links, likely during an earlier migration.

shankari commented 1 year ago

Double checking this before I start the giant back-to-back meetings. All these errors are because there were confirmed trips but no composite trips

DEBUG:root:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID('af046be2-3593-42b3-b2e5-ea4ba8287bbf'), '$or': [{'metadata.key': 'analysis/cleaned_trip'}]}, sort_key = metadata.write_ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/cleaned_trip']
DEBUG:root:finished querying values for ['analysis/cleaned_trip'], count = 2
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 2
DEBUG:root:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID('af046be2-3593-42b3-b2e5-ea4ba8287bbf'), '$or': [{'metadata.key': 'analysis/confirmed_trip'}]}, sort_key = metadata.write_ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/confirmed_trip']
DEBUG:root:finished querying values for ['analysis/confirmed_trip'], count = 2
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 2
DEBUG:root:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID('af046be2-3593-42b3-b2e5-ea4ba8287bbf'), '$or': [{'metadata.key': 'analysis/composite_trip'}]}, sort_key = metadata.write_ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/composite_trip']
DEBUG:root:finished querying values for ['analysis/composite_trip'], count = 0
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 0

ERROR:root:Found error ObjectId('634854f22c03517c520aaff0') while processing pipeline for user af046be2-3593-42b3-b2e5-ea4ba8287bbf, skipping
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/nrel-db-connect/bin/historical/migrations/add_sections_and_summaries_to_trips.py", line 27, in add_sections_to_trips
    add_sections_to_trips_for_user(uuid)
  File "/Users/kshankar/e-mission/nrel-db-connect/bin/historical/migrations/add_sections_and_summaries_to_trips.py", line 45, in add_sections_to_trips_for_user
    matching_composite_trip = composite_trips_map[t["_id"]]
KeyError: ObjectId('634854f22c03517c520aaff0')
  1. This is not likely to be perceived as broken anyway then since the composite trips would never be visible anyway
  2. Let's briefly look at the logs to figure out why this is happening
shankari commented 1 year ago

Ok so this is happening because of an error with the backwards compat migration code in the composite trips. I noticed this in the error even before the migration. I'm pulling the stage data so we can investigate and fix this as part of pipeline fixes

2023-06-26 17:10:31,050:ERROR:140368867940160:Error while creating composite objects, timestamp is unchanged
Traceback (most recent call last):
File "/usr/src/app/emission/analysis/plotting/composite_trip_creation.py", line 138, in create_composite_objects
last_done_ts = create_composite_trip(ts, t)
File "/usr/src/app/emission/analysis/plotting/composite_trip_creation.py", line 62, in create_composite_trip
assert next_trip is not None and next_trip["metadata"]["key"] == "analysis/confirmed_trip" \
AssertionError: for 634854f22c03517c520aaff0 found existing_end_confirmed_place={'_id': ObjectId('641ba62b34dda48794b6be65'), 'user_id': UUID('af046be2-3593-42b3-b2e5-ea4ba8287bbf'), 'metadata': {'key': 'analysis/confirmed_place', 'platform': 'server', 'write_ts': 1665684710.845848, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2022, 'month': 10, 'day': 13, 'hour': 11, 'minute': 11, 'second': 50, 'weekday': 3, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2022-10-13T11:11:50.845848-07:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'enter_ts': 1665173762.129, 'enter_local_dt': {'year': 2022, 'month': 10, 'day': 7, 'hour': 13, 'minute': 16, 'second': 2, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'enter_fmt_time': '2022-10-07T13:16:02.129000-07:00', 'location': {'type': 'Point', 'coordinates': [-104.9618674, 39.6717046]}, 'raw_places': [ObjectId('634854df2c03517c520aafad'), ObjectId('634854df2c03517c520aafad')], 'ending_trip': ObjectId('634854e62c03517c520aafb9'), 'starting_trip': ObjectId('634854e62c03517c520aafcb'), 'exit_ts': 1665173767.129, 'exit_fmt_time': '2022-10-07T13:16:02.129000-07:00', 'exit_local_dt': {'year': 2022, 'month': 10, 'day': 7, 'hour': 13, 'minute': 16, 'second': 2, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'duration': 5.0, 'cleaned_place': ObjectId('634854e62c03517c520aafdd'), 'user_input': {}, 'additions': []}} but next_trip=None
shankari commented 11 months ago

While experimenting with trip time calculation, ran into this again, and it again had exactly 807 matches. https://github.com/e-mission/op-admin-dashboard/pull/61#issuecomment-1784145680

So this means that I either did not reset the pipeline for this user on staging or there is something super weird about this user that ends up with 807 matches. https://github.com/e-mission/e-mission-docs/issues/927#issuecomment-1606674206 It seems likely to be an overpass error because the odds that it would fail again exactly 807 errors is incredibly small.

>>> all_807_matches = list(edb.get_analysis_timeseries_db().find({'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section': ObjectId('644df8edea199f1d0473e301')}))
>>> all_807_matches_df = pd.json_normalize(all_807_matches)
>>> len(all_807_matches_df)
807
>>> all_807_matches_df["data.cleaned_section"].unique()
array([ObjectId('644df8edea199f1d0473e301')], dtype=object)
>>> all_807_matches_df["data.sensed_mode"].unique()
array([2])

When were these entries created

>>> all_807_matches_df["metadata.write_fmt_time"].head()
0    2023-04-29T22:13:17.690103-07:00
1    2023-04-29T22:13:17.690103-07:00
2    2023-04-29T22:13:17.690103-07:00
3    2023-04-29T22:13:17.690103-07:00
4    2023-04-29T22:13:17.690103-07:00
...
>>> all_807_matches_df["metadata.write_fmt_time"].tail()
802    2023-04-29T22:13:17.690103-07:00
803    2023-04-29T22:13:17.690103-07:00
804    2023-04-29T22:13:17.690103-07:00
805    2023-04-29T22:13:17.690103-07:00
806    2023-04-29T22:13:17.690103-07:00

At exactly the same time (with the same exact timestamp), way before this was supposedly reset (on Jun 25)

Resetting and re-running the pipeline for this user in the copied staging. We may also want to replace that assertion by checking to see if all the entries as the same - e.g. seeing if unique has only one set of values, at least on production.

>>> all_807_matches_df["metadata.write_ts"].unique()
array([1.6828316e+09])
>>> all_807_matches[0]["metadata"]["write_ts"]
1682831597.6901028
>>> all_807_matches[-1]["metadata"]["write_ts"]
1682831597.6901028
shankari commented 11 months ago

Ran into this for another user

>>> all_4_entries = pd.json_normalize(list(edb.get_analysis_timeseries_db().find({'metadata.key': 'analysis/inferred_section', 'user_id': UUID('fdcb2d34-c8e8-4d5a-8d98-e0e2d397e490'), 'data.cleaned_section': ObjectId('6313b50f0b4e353e781a95a4')})))
>>> all_4_entries["data.cleaned_section"]
0    6313b50f0b4e353e781a95a4
1    6313b50f0b4e353e781a95a4
2    6313b50f0b4e353e781a95a4
3    6313b50f0b4e353e781a95a4
Name: data.cleaned_section, dtype: object
>>> all_4_entries["data.cleaned_section"].unique()
array([ObjectId('6313b50f0b4e353e781a95a4')], dtype=object)
>>> all_4_entries["metadata.write_fmt_time"]
0    2022-09-03T13:11:59.935477-07:00
1    2022-09-03T13:11:59.935477-07:00
2    2022-09-03T13:11:59.935477-07:00
3    2022-09-03T13:11:59.935477-07:00
Name: metadata.write_fmt_time, dtype: object
>>> all_4_entries["data.start_fmt_time"]
0    2022-09-03T10:33:00.996521-07:00
1    2022-09-03T10:33:00.996521-07:00
2    2022-09-03T10:33:00.996521-07:00
3    2022-09-03T10:33:00.996521-07:00
Name: data.start_fmt_time, dtype: object
>>> all_4_entries["data.sensed_mode"]
0    5
1    5
2    5
3    5
Name: data.sensed_mode, dtype: int64
>>>
achasmita commented 11 months ago

While experimenting with trip time calculation, ran into this again, and it again had exactly 807 matches. e-mission/op-admin-dashboard#61 (comment)

Resetting and re-running the pipeline for this user in the copied staging. We may also want to replace that assertion by checking to see if all the entries as the same - e.g. seeing if unique has only one set of values, at least on production.

>>> all_807_matches_df["metadata.write_ts"].unique()
array([1.6828316e+09])
>>> all_807_matches[0]["metadata"]["write_ts"]
1682831597.6901028
>>> all_807_matches[-1]["metadata"]["write_ts"]
1682831597.6901028
There are a lot of 807 matches for this user: ``` bash Data.cleaned_section 644dec45ea199f1d0473a2aa, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dec55ea199f1d0473a31d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dec65ea199f1d0473a33a, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dec73ea199f1d0473a357, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dec82ea199f1d0473a38b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dec91ea199f1d0473a3d5, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deca0ea199f1d0473a3fc, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644decacea199f1d0473a40b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644decb4ea199f1d0473a455, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644decbcea199f1d0473a4bd, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644decc5ea199f1d0473a4d8, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644decd0ea199f1d0473a51a, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644decdfea199f1d0473a553, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dece8ea199f1d0473a566, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644decefea199f1d0473a583, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644decf8ea199f1d0473a636, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded04ea199f1d0473a6c3, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded11ea199f1d0473a71c, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded19ea199f1d0473a728, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded21ea199f1d0473a72b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded29ea199f1d0473a753, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded34ea199f1d0473a756, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded44ea199f1d0473a788, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded53ea199f1d0473a7f6, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded5fea199f1d0473a821, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded6aea199f1d0473a84a, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded78ea199f1d0473a8b5, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded87ea199f1d0473a8dd, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded93ea199f1d0473a8e3, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644ded9aea199f1d0473a9b4, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deda2ea199f1d0473a9c4, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deda9ea199f1d0473a9f6, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dedb3ea199f1d0473aa09, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dedbeea199f1d0473aa29, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dedc9ea199f1d0473aa3b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dedd9ea199f1d0473aa84, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dede9ea199f1d0473aae0, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dedf6ea199f1d0473aafb, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dedfeea199f1d0473ab10, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee09ea199f1d0473ab14, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee14ea199f1d0473ab21, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee20ea199f1d0473ab3b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee30ea199f1d0473ab44, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee3bea199f1d0473ab84, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee43ea199f1d0473abb0, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee4dea199f1d0473abd0, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee58ea199f1d0473abdc, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee5fea199f1d0473abff, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee6aea199f1d0473ac1a, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee79ea199f1d0473ac42, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee85ea199f1d0473ac65, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee90ea199f1d0473ac83, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644dee9eea199f1d0473ac8c, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deea7ea199f1d0473acbe, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deeaeea199f1d0473acd2, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deeb6ea199f1d0473acec, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deec0ea199f1d0473ad09, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deecbea199f1d0473adf3, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deed7ea199f1d0473ae17, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deee0ea199f1d0473ae2d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deee8ea199f1d0473af74, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deef3ea199f1d0473af7d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def04ea199f1d0473b005, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def0cea199f1d0473b16a, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def13ea199f1d0473b179, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def1cea199f1d0473b18f, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def26ea199f1d0473b23a, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def2dea199f1d0473b260, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def35ea199f1d0473b271, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def3fea199f1d0473b27b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def4bea199f1d0473b2aa, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def55ea199f1d0473b438, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def5fea199f1d0473b476, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def69ea199f1d0473b4c7, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def75ea199f1d0473b4d6, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def80ea199f1d0473b5a2, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def8eea199f1d0473b671, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644def9aea199f1d0473b926, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644defa4ea199f1d0473b946, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644defaeea199f1d0473b94d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644defbbea199f1d0473b95d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644defcaea199f1d0473b974, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644defd8ea199f1d0473ba5e, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644defe4ea199f1d0473ba6f, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644deff0ea199f1d0473ba96, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644defffea199f1d0473bad6, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df00bea199f1d0473bb06, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df017ea199f1d0473bb38, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df024ea199f1d0473bb41, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df034ea199f1d0473bb4d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df042ea199f1d0473bceb, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df050ea199f1d0473bd04, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df058ea199f1d0473bd97, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df062ea199f1d0473bdd5, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df07fea199f1d0473bde0, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df08bea199f1d0473be32, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df092ea199f1d0473bedb, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df09bea199f1d0473bede, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df0a3ea199f1d0473bfb7, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df0acea199f1d0473c085, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df0c8ea199f1d0473c301, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df0d4ea199f1d0473c35b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df0e0ea199f1d0473c380, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df0ebea199f1d0473c393, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df0f6ea199f1d0473c3af, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df0fdea199f1d0473c3c7, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df105ea199f1d0473c3e4, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df10fea199f1d0473c40c, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df11aea199f1d0473c423, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df124ea199f1d0473c443, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df12fea199f1d0473c446, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df13dea199f1d0473c459, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df148ea199f1d0473c4a8, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df153ea199f1d0473c4ba, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df161ea199f1d0473c4c5, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df16dea199f1d0473c4d6, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df179ea199f1d0473c500, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df187ea199f1d0473c519, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df195ea199f1d0473c52d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df1a3ea199f1d0473c54e, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df1b0ea199f1d0473c574, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df1beea199f1d0473c58e, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df1c9ea199f1d0473c5a3, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df1d4ea199f1d0473c5bf, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df1e2ea199f1d0473c5ca, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df1f2ea199f1d0473c5ec, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df200ea199f1d0473c616, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df20bea199f1d0473c63d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df216ea199f1d0473c668, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df220ea199f1d0473c671, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df228ea199f1d0473c67b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df235ea199f1d0473c6a2, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df243ea199f1d0473c6ab, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df251ea199f1d0473c798, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df260ea199f1d0473c7b8, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df26eea199f1d0473c7d7, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df278ea199f1d0473c812, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df27fea199f1d0473c819, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df287ea199f1d0473c823, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df28fea199f1d0473c837, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df296ea199f1d0473c85e, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2a1ea199f1d0473c876, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2baea199f1d0473c8c6, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2c2ea199f1d0473c8ed, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2caea199f1d0473c900, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2d1ea199f1d0473c985, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2dcea199f1d0473c988, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2e7ea199f1d0473c991, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2f2ea199f1d0473c9c0, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df2fdea199f1d0473c9d5, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df308ea199f1d0473ca1d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df317ea199f1d0473ca3b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df329ea199f1d0473caac, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df337ea199f1d0473cb6b, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df345ea199f1d0473cbdd, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df352ea199f1d0473cc2a, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df35fea199f1d0473cd7d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df367ea199f1d0473cd93, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df372ea199f1d0473cda3, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df380ea199f1d0473ce21, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df390ea199f1d0473ce2e, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df39eea199f1d0473cea4, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df3adea199f1d0473cec6, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df3bbea199f1d0473cf3d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df3c6ea199f1d0473cf56, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df3d1ea199f1d0473cf74, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df3ddea199f1d0473cf7a, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df3e8ea199f1d0473cfa3, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df3f9ea199f1d0473cff4, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df408ea199f1d0473d01d, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df418ea199f1d0473d0a9, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df428ea199f1d0473d0c9, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df437ea199f1d0473d0ec, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df446ea199f1d0473d105, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df455ea199f1d0473d1f3, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df461ea199f1d0473d213, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df46dea199f1d0473d242, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df479ea199f1d0473d251, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df481ea199f1d0473d26f, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df48aea199f1d0473d274, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df493ea199f1d0473d27c, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df49aea199f1d0473d28f, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df4a4ea199f1d0473d2b1, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df4b0ea199f1d0473d2be, user_id d83a43a1-df6b-42ed-986f-f5b5f6150221: Count807 Data.cleaned_section 644df4bcea199f1 ```
shankari commented 11 months ago

There are a lot of 807 matches for this user:

I am 99% sure this is not correct. When I poked through the data, the 807 entries was definitely an outlier. Please double-check and document with the script for checking this....

And regardless, there is a workaround getting rid of the 807 entries, so I am not sure why this is relevant.

achasmita commented 11 months ago

I was using this code

>>> df = pd.json_normalize(list(edb.get_analysis_timeseries_db().find({'metadata.key': 'analysis/inferred_section'})))
coun
>>> counts = df.groupby(['data.cleaned_section', 'user_id']).size().reset_index(name='count')
>>> dup_counts=counts[counts['count']>1]
>>> print('\n'.join([f"Data.cleaned_section {row['data.cleaned_section']},  user_id {row['user_id']}: Count{row['count']}" for _, row in dup_counts.iterrows() ]))
>>> all_807_entries = pd.json_normalize(list(edb.get_analysis_timeseries_db().find({'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section'
: ObjectId('644dec55ea199f1d0473a31d')})))
>>> len(all_807_entries)
807
>>> all_807_entries["data.cleaned_section"].unique()
array([ObjectId('644dec55ea199f1d0473a31d')], dtype=object)
>>> all_807_entries["metadata.write_fmt_time"].head()
0    2023-04-29T21:19:33.405336-07:00
1    2023-04-29T21:19:33.405336-07:00
2    2023-04-29T21:19:33.405336-07:00
3    2023-04-29T21:19:33.405336-07:00
4    2023-04-29T21:19:33.405336-07:00
Name: metadata.write_fmt_time, dtype: object
>>> all_807_entries["metadata.write_fmt_time"].tail()
802    2023-04-29T21:19:33.405336-07:00
803    2023-04-29T21:19:33.405336-07:00
804    2023-04-29T21:19:33.405336-07:00
805    2023-04-29T21:19:33.405336-07:00
806    2023-04-29T21:19:33.405336-07:00
Name: metadata.write_fmt_time, dtype: object 
>>> all_807_entries = pd.json_normalize(list(edb.get_analysis_timeseries_db().find({'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section': ObjectId('644dec45ea199f1d0473a2aa')})))
>>> all_807_entries["data.cleaned_section"].unique()
array([ObjectId('644dec45ea199f1d0473a2aa')], dtype=object)
>>> all_807_entries["data.cleaned_section"].unique()
array([ObjectId('644dec45ea199f1d0473a2aa')], dtype=object)
>>> all_807_entries["metadata.write_fmt_time"].head()
0    2023-04-29T21:19:17.707970-07:00
1    2023-04-29T21:19:17.707970-07:00
2    2023-04-29T21:19:17.707970-07:00
3    2023-04-29T21:19:17.707970-07:00
4    2023-04-29T21:19:17.707970-07:00
Name: metadata.write_fmt_time, dtype: object
>>> all_807_entries["metadata.write_fmt_time"].tail()
802    2023-04-29T21:19:17.707970-07:00
803    2023-04-29T21:19:17.707970-07:00
804    2023-04-29T21:19:17.707970-07:00
805    2023-04-29T21:19:17.707970-07:00
806    2023-04-29T21:19:17.707970-07:00
Name: metadata.write_fmt_time, dtype: object
achasmita commented 10 months ago

While experimenting with trip time calculation, ran into this again, and it again had exactly 807 matches.

I was getting the same error while checking trip time calculation, that is why I was looking into it.

shankari commented 10 months ago

correct, but as I said in https://github.com/e-mission/e-mission-docs/issues/927#issuecomment-1804395333

And regardless, there is a workaround getting rid of the 807 entries, so I am not sure why this is relevant.

You should use the workaround and move on