Get the analysed view to work with files as well

MobilityNet / mobilitynet.github.io

BSD 3-Clause "New" or "Revised" License

0 stars 3 forks source link

Get the analysed view to work with files as well #31

Closed shankari closed 1 year ago

shankari commented 1 year ago

Right now, the raw data can be read from a ServerSpec or a FileSpec. It would help to be able to read the analysed data also from a FileSpec.

People have been dealing with this through pickling, but this is sub-optimal because: 1) pickling is a binary format 2) pickling format can change 3) reading from the pickled values requires importing pymongo

Instead, let's store the results in files as well.

The analysis results are currently read as follows:

we read the raw results
we convert it to an analysed view by copying it and then reading the additional analysed entries
the signature of the analysed view is create_analysed_view(input_view, analysis_datastore, location_key, trip_key, section_key)

when we try to run it with

av_la = eapv.create_analysed_view(pv_la, "http://localhost:8080", "analysis/recreated_location", "analysis/cleaned_trip", "analysis/inferred_section")

we get the error

not found: data_file='$PWD/http://localhost:8080/ucb-sdb-android-1/unimodal_trip_car_bike_mtv_la/analysis~recreated_location/1563606000_1670460462.json'

This is because we set the analysis datastore to the input spec

asd.DATASTORE_LOC = analysis_datastore

but we now have a mismatch between the spec, which is a filespec, and the analysis_datastore, which is a serverspec.

Our other requirement is that we need to support multiple possible analysed views for various versions of the algorithms - so one for master and one for gis_branch for example.

We need to refactor the analysis view code as follows:

if we have a server spec, then we potentially want to have separate URLs for the two branches since we can have multiple database containers running at the same time
if we have a file spec, then we want to have separate file paths for the branches, since we want to have the data stored at the same time

This seems to suggest the following changes:

change the signature to create_analysed_view(input_view, analysis_spec, location_key, trip_key, section_key)
change the bin/dump_data_to_file.py to also retrieve and dump analysed data

shankari commented 1 year ago

Note also that during the original implementation of create_analysed_view, we had the following:

# The datastreams API call filters by "metadata.write_ts"
# Unfortunately, this means that we can't use it to retrieve analysed results since the write_ts depends on when the pipeline was run

However, we now do support reading by data.start_ts (as part of the label screen changes). So we might be able to simplify this, but might also just run out of time.

shankari commented 1 year ago

Ok, so we currently dump top level keys and range level keys for the raw data

        for phone_os, phone_map in pv.map().items():
            for phone_label, phone_detail_map in phone_map.items():
                for key in [k for k in phone_detail_map.keys() if "/" in k]:
                    print(f"Dumping top level key {key}")

and

                for ranges in [phone_detail_map["evaluation_ranges"], phone_detail_map["calibration_ranges"]]:
                    for r in ranges:
                        for key in [k for k in r.keys() if "/" in k]:
                            print(f"Dumping key {key} for range with keys {r.keys()} and phone {phone_label}")

The top level keys are essentially only manual/evaluation_transition

$ grep "Dumping top level" /tmp/download.logs
Dumping top level key manual/evaluation_transition
Dumping top level key manual/evaluation_transition
Dumping top level key manual/evaluation_transition
Dumping top level key manual/evaluation_transition
Dumping top level key manual/evaluation_transition
Dumping top level key manual/evaluation_transition
Dumping top level key manual/evaluation_transition
Dumping top level key manual/evaluation_transition

The range level keys are

Dumping key background/location for range with keys dict_keys(['trip_id', 'trip_id_base', 'trip_run', 'start_ts', 'end_ts', 'duration', 'eval_common_trip_id', 'eval_role', 'eval_role_base', 'eval_role_run', 'evaluation_trip_ranges', 'background/battery', 'battery_df', 'background/location', 'background/filtered_location', 'location_df', 'filtered_location_df', 'background/motion_activity', 'motion_activity_df', 'statemachine/transition', 'transition_df']) and phone ucb-sdb-ios-4
Dumping key background/filtered_location for range with keys dict_keys(['trip_id', 'trip_id_base', 'trip_run', 'start_ts', 'end_ts', 'duration', 'eval_common_trip_id', 'eval_role', 'eval_role_base', 'eval_role_run', 'evaluation_trip_ranges', 'background/battery', 'battery_df', 'background/location', 'background/filtered_location', 'location_df', 'filtered_location_df', 'background/motion_activity', 'motion_activity_df', 'statemachine/transition', 'transition_df']) and phone ucb-sdb-ios-4
Dumping key background/motion_activity for range with keys dict_keys(['trip_id', 'trip_id_base', 'trip_run', 'start_ts', 'end_ts', 'duration', 'eval_common_trip_id', 'eval_role', 'eval_role_base', 'eval_role_run', 'evaluation_trip_ranges', 'background/battery', 'battery_df', 'background/location', 'background/filtered_location', 'location_df', 'filtered_location_df', 'background/motion_activity', 'motion_activity_df', 'statemachine/transition', 'transition_df']) and phone ucb-sdb-ios-4
Dumping key statemachine/transition for range with keys dict_keys(['trip_id', 'trip_id_base', 'trip_run', 'start_ts', 'end_ts', 'duration', 'eval_common_trip_id', 'eval_role', 'eval_role_base', 'eval_role_run', 'evaluation_trip_ranges', 'background/battery', 'battery_df', 'background/location', 'background/filtered_location', 'location_df', 'filtered_location_df', 'background/motion_activity', 'motion_activity_df', 'statemachine/transition', 'transition_df']) and phone ucb-sdb-ios-4
Dumping key background/battery for range with keys dict_keys(['trip_id', 'trip_id_base', 'trip_run', 'start_ts', 'end_ts', 'duration', 'eval_common_trip_id', 'eval_role', 'eval_role_base', 'eval_role_run', 'evaluation_trip_ranges', 'background/battery', 'battery_df', 'background/location', 'background/filtered_location', 'location_df', 'filtered_location_df', 'background/motion_activity', 'motion_activity_df', 'statemachine/transition', 'transition_df']) and phone ucb-sdb-ios-4

which is consistent with what we see in the directories

$ ls data/ucb-sdb-android-2/unimodal_trip_car_bike_mtv_la/
background~battery      background~location     manual~evaluation_transition
background~filtered_location    background~motion_activity  statemachine~transition

shankari commented 1 year ago

For the analysed data, we don't need to dump top-level data since we will already have it from the raw phone view. Instead, we only need to dump the three additional keys that we add in the analysed view: location_entries, sensed_trip_ranges and sensed_section_ranges

shankari commented 1 year ago

High level question: we can have different trips and different sections - e.g. cleaned versus confirmed, cleaned versus inferred, etc. How do we store these and read them in the phone view?

Option 1: store them in the phone view as the original keys (e.g. analysis/confirmed_trip) to be consistent with the raw data. In this case, the tag can be the branch - e.g. master vs. gis
Option 2: store them in the phone view as standard keys (e.g. location_entries) to be consistent with the current analysed phone view. In this case, the tag must be the branch and the combo (e.g. master_cleaned_cleaned or master_confirmed_inferred) or ...

It is pretty clear that Option 1 is the better option for downloading, although Option 2 might be the better option for loading the data into the analysed phone view.

It is more consistent with the existing data
It is more consistent with reading from the server (so can be consistent across server and file specs)
It does not require us to generate all combinations of trips and sections
It does not require us to duplicate entries for all the combinations - e.g. the location_entries will likely be the same for all combinations

So we will not use the phone view for downloading, but just download directly using the server spec.

shankari commented 1 year ago

ok, so here's another problem. The analysed phone view currently reads all the trips and sections because of https://github.com/MobilityNet/mobilitynet.github.io/issues/31#issuecomment-1341931561 and then copies over the subset ranges.

This has the problem that when it queries, it queries for the range from the start of the evaluation to now. However, when the filespec reads data, it reads from the start to the end, and that end will be now, which means that it won't work.

There are several possible options around using -1 for now and so on, but given that this was a hack to avoid the limitation with the datastreams that we have now addressed anyway, I think we should fix it the right way.

shankari commented 1 year ago

Here's another issue - the start time (after extrapolation) of a trip could be before the evaluation range started. In the analysed_view, we add in a threshold of THIRTY_MINUTES for the matching; let's do the same while downloading in the dump script.

Should we store based on the actual range start and end or the padded by THIRTY_MINUTES start and end? Let's store with the padded values to keep the meaning of the files the same.

shankari commented 1 year ago

Running the pipeline for individual users to monitor it better. While running it with ucb-sdb-android-1, we get a lot of the following errors. This is not a huge issues (yet) since we do not plan to download these as part of the analysis results. But we should really think about how to unify the emeval zephyr code into master. git submodule? python package?

Got error No module named 'emission.net.usercache.formatters.android.evaluation_transition' while saving entry AttrDict({'_id': ObjectId('5d00bc68b88f219ca051064f'), 'metadata': {'key': 'manual/evaluation_transition', 'platform': 'android', 'read_ts': 0, 'time_zone': 'America/Los_Angeles', 'type': 'message', 'write_ts': 1560329319}, 'user_id': UUID('6a2dbafd-ef1e-404c-b61e-506b8935dca4'), 'data': {'transition': 'START_CALIBRATION_PERIOD', 'trip_id': 'high_accuracy_stationary', 'spec_id': 'sfba_trial_3', 'device_manufacturer': 'motorola', 'device_model': 'Nexus 6', 'device_version': '6.0.1', 'ts': 1560329319}}) -> None

shankari commented 1 year ago

We end up with 11 trips, the last of which is at 2019-07-12T22:51:12. I guess this is because we didn't duty cycle after that? Yup!

2022-12-09 06:34:40,954:DEBUG:4591250944:filter_accuracy disabled, early return

Wait

>>> pd.json_normalize(list(edb.get_timeseries_db().find({"user_id": UUID("6a2dbafd-ef1e-404c-b61e-506b8935dca4"), "metadata.key": "background/filtered_location"}).sort("data.ts", -1).limit(3)))[["data.fmt_time"]]
                      data.fmt_time
0         2019-07-12T15:51:43-07:00
1         2019-07-12T15:51:11-07:00
2  2019-07-12T15:50:39.055000-07:00

but

>>> pd.json_normalize(list(edb.get_timeseries_db().find({"user_id": UUID("6a2dbafd-ef1e-404c-b61e-506b8935dca4"), "metadata.key": "background/location"}).sort("data.ts", -1).limit(3)))[["data.fmt_time"]]
               data.fmt_time
0  2019-07-28T17:16:38-07:00
1  2019-07-28T17:16:37-07:00
2  2019-07-28T17:16:36-07:00

Maybe we need to run it again because there is so much incoming data.

Yup!

>>> pd.json_normalize(list(edb.get_usercache_db().find({"user_id": UUID("6a2dbafd-ef1e-404c-b61e-506b8935dca4"), "metadata.key": "background/location"}).sort("data.ts", -1).limit(3)))[["data.fmt_time"]]
            data.fmt_time
0  Mar 4, 2020 5:42:43 PM
1  Mar 4, 2020 5:42:42 PM
2  Mar 4, 2020 5:42:41 PM

shankari commented 1 year ago

After running it multiple times, we still have a few entries left in the usercache (https://github.com/e-mission/e-mission-docs/issues/761)

>>> pd.json_normalize(list(edb.get_usercache_db().find({"user_id": UUID("6a2dbafd-ef1e-404c-b61e-506b8935dca4"), "metadata.key": "background/location"}).sort("data.ts", -1)))[["data.fmt_time"]]
              data.fmt_time
0   Nov 22, 2019 6:40:05 PM
1   Nov 22, 2019 6:40:04 PM
2   Nov 22, 2019 6:40:04 PM
3   Nov 22, 2019 6:40:03 PM
4   Nov 22, 2019 6:40:02 PM
5   Nov 22, 2019 6:40:02 PM
6   Nov 22, 2019 6:40:01 PM
7   Nov 22, 2019 6:40:01 PM
8   Jul 28, 2019 5:16:42 PM
9   Jul 28, 2019 5:16:41 PM
10  Jul 28, 2019 5:16:40 PM
11  Jul 28, 2019 5:16:39 PM

shankari commented 1 year ago

After running the master pipeline, trying to download data, but ran into an issue where we were still trying to read the spec from the server to determine whether the input spec was valid.

$ python dump_data_to_file.py --spec-id unimodal_trip_car_bike_mtv_la analysed master_9b70c97 --raw_dir data
Retrieving data for: post_body={'user': 'shankari@eecs.berkeley.edu', 'key_list': ['config/evaluation_spec'], 'key_time': 'metadata.write_ts', 'start_time': 0, 'end_time': 9223372036854775807}
response=<Response [200]>
Found 0 entries
Traceback (most recent call last):
  File "dump_data_to_file.py", line 265, in <module>
    assert args.spec_id in spec_ids,\
AssertionError: spec_id `unimodal_trip_car_bike_mtv_la` not found within current datastore instance

Tried to change that to read from the local server by calling retrieve_data on the raw_dir FileSpecDetails instead, but it requires a CURR_SPEC_ID.

fsd = eisd.FileSpecDetails(args.raw_dir, args.author_email)
fsd.retrieve_data(args.author_email, "config/evaluation_spec", 0, sys.maxsize)

Traceback (most recent call last):
  File "dump_data_to_file.py", line 265, in <module>
    args.func(args)
  File "dump_data_to_file.py", line 62, in download_analysed
    fsd.retrieve_data(args.author_email, "config/evaluation_spec", 0, sys.maxsize)
  File "../emeval/input/spec_details.py", line 189, in retrieve_data
    f"{user}/{self.CURR_SPEC_ID}/{key.replace('/', '~')}/{math.floor(start_ts)}_{math.ceil(end_ts)}.json")
AttributeError: 'FileSpecDetails' object has no attribute 'CURR_SPEC_ID'

Moving the retrieve all specs call into the spec details; if this doesn't work, will require the specid instead.

shankari commented 1 year ago

Downloaded details for the unimodal spec correctly; moving on to the other specs

Note the ranges for the individual phones are slightly different; even for the non-control phones. I am not 100% sure why that is happening, but it is consistent for both the raw and analysed data. Might want to take a look to see why.

$ ls -al bin/data/ucb-sdb-android-2/unimodal_trip_car_bike_mtv_la/background~location/
1601494 Dec  8 21:25 1564274304_1564282403.json
1550345 Dec  8 21:25 1564334125_1564343116.json
1805219 Dec  8 21:25 1564351292_1564360116.json
2156696 Dec  8 21:25 1565571044_1565578987.json
1804446 Dec  8 21:25 1567271214_1567279428.json
1825668 Dec  8 21:25 1567288623_1567297358.json

$ ls -al bin/data/master_9b70c97/ucb-sdb-android-2/unimodal_trip_car_bike_mtv_la/analysis~recreated_location/
total 1584
118190 Dec 10 19:02 1564272504_1564284203.json
86702 Dec 10 19:02 1564332325_1564344916.json
111861 Dec 10 19:02 1564349492_1564361916.json
128537 Dec 10 19:02 1565569244_1565580787.json
115558 Dec 10 19:02 1567269414_1567281228.json
110241 Dec 10 19:02 1567286823_1567299158.json

$ ls -al bin/data/ucb-sdb-android-3/unimodal_trip_car_bike_mtv_la/background~location/
total 11448
82102 Dec  8 21:25 1564274288_1564282424.json
102737 Dec  8 21:25 1564334097_1564343026.json
75994 Dec  8 21:25 1564351277_1564360135.json
1883423 Dec  8 21:25 1565571018_1565578933.json
1863435 Dec  8 21:25 1567271178_1567279373.json
1841009 Dec  8 21:25 1567288638_1567297395.json

$ ls -al bin/data/master_9b70c97/ucb-sdb-android-3/unimodal_trip_car_bike_mtv_la/analysis~recreated_location/
total 1608
116713 Dec 10 19:02 1564272488_1564284224.json
113150 Dec 10 19:02 1564332297_1564344826.json
122121 Dec 10 19:02 1564349477_1564361935.json
155807 Dec 10 19:02 1565569218_1565580733.json
112924 Dec 10 19:02 1567269378_1567281173.json
105991 Dec 10 19:02 1567286838_1567299195.json

shankari commented 1 year ago

Pulled all the results, and now trying to run the Evaluation*analysis_master.

we need a CURR_SPEC_ID and other spec entries to be filled in so that the phone view et al can work. Concretely, the CURR_SPEC_ID shows up in the path to the dumped file
But if we specify the spec ID upfront, then they are in the raw location, not the analysed location, so reading them fails
- The long-term solution is to create an AnalysedFileSpec that takes a FileSpec as input and delegates the reads to it
- The short-term solution is to copy over the CURR_SPEC_ID, the curr_spec_entry and all the other items from populate_spec_details
before this, we could not read analysed results based on data fields, so we had to read all the data and then filter into each individual range. We can now query based on data fields, so we query for one range at a time (consistent with https://github.com/MobilityNet/mobilitynet.github.io/issues/31#issuecomment-1343805591).
since we already have the data in ranges, we don't need to subset the trips any more. We do have to subset the sections (and their locations) for each trip and only the locations for each section.

With these changes, we are able to load the analysed view for each phone view.

shankari commented 1 year ago

but while running the notebook, there is an error because there are no matching sensed segments for

2019-07-27T19:20:31.060968-07:00 -> 2019-07-27T19:20:57.402429-07:00
[]
Found no sensed segments, early return
[]

2019-07-24T16:37:07.746717-07:00 -> 2019-07-24T16:41:54.618997-07:00

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-47-aa44bf45f9ff> in <module>
----> 1 check_outlier(av_ucb.map()['ios']['ucb-sdb-ios-3']["evaluation_ranges"][0], 2, "walk to the bikeshare location_0", "WALKING")

<ipython-input-41-820e61dbfbd7> in check_outlier(eval_range, trip_idx, section_id, base_mode)
      8     eval_section = [s for s in eval_trip["evaluation_section_ranges"] if s["trip_id"] == section_id][0]
      9     print(fmt(eval_section["start_ts"]), "->", fmt(eval_section["end_ts"]))
---> 10     print([(fmt(ssr["start_ts"]), fmt(ssr["end_ts"]), ssr["mode"]) for ssr in eval_trip["sensed_section_ranges"]])
     11     matching_section_map = embs.find_matching_segments(eval_trip["evaluation_section_ranges"], "trip_id", eval_trip["sensed_section_ranges"])
     12     sensed_section_range = matching_section_map[section_id]["match"]

<ipython-input-41-820e61dbfbd7> in <listcomp>(.0)
      8     eval_section = [s for s in eval_trip["evaluation_section_ranges"] if s["trip_id"] == section_id][0]
      9     print(fmt(eval_section["start_ts"]), "->", fmt(eval_section["end_ts"]))
---> 10     print([(fmt(ssr["start_ts"]), fmt(ssr["end_ts"]), ssr["mode"]) for ssr in eval_trip["sensed_section_ranges"]])
     11     matching_section_map = embs.find_matching_segments(eval_trip["evaluation_section_ranges"], "trip_id", eval_trip["sensed_section_ranges"])
     12     sensed_section_range = matching_section_map[section_id]["match"]

KeyError: 'start_ts'

shankari commented 1 year ago

Tried to pull up the version with the prior results, but don't see any outputs for these cells in the checked-in version. But all the last tests are failing with

KeyError                                  Traceback (most recent call last)
<ipython-input-49-cf4b53f0c6bf> in <module>
----> 1 check_outlier(pv_la.map()['android']['ucb-sdb-android-3']["evaluation_ranges"][0], 0, "walk_start_0", "WALKING")

<ipython-input-41-820e61dbfbd7> in check_outlier(eval_range, trip_idx, section_id, base_mode)
      8     eval_section = [s for s in eval_trip["evaluation_section_ranges"] if s["trip_id"] == section_id][0]
      9     print(fmt(eval_section["start_ts"]), "->", fmt(eval_section["end_ts"]))
---> 10     print([(fmt(ssr["start_ts"]), fmt(ssr["end_ts"]), ssr["mode"]) for ssr in eval_trip["sensed_section_ranges"]])
     11     matching_section_map = embs.find_matching_segments(eval_trip["evaluation_section_ranges"], "trip_id", eval_trip["sensed_section_ranges"])
     12     sensed_section_range = matching_section_map[section_id]["match"]

KeyError: 'sensed_section_ranges'

Just need to debug by hand.

shankari commented 1 year ago

Let's start with the lack of matches for

2019-07-27T19:20:31.060968-07:00 -> 2019-07-27T19:20:57.402429-07:00

The corresponding evaluation range

>>> arrow.get(range_0["start_ts"]).to("America/Los_angeles"), arrow.get(range_0["end_ts"]).to("America/Los_angeles")
(<Arrow [2019-07-27T17:38:24.968000-07:00]>,
 <Arrow [2019-07-27T19:53:22.886000-07:00]>)

has two trips

2019-07-27T17:38:54.143985-07:00 2019-07-27T17:54:56.504297-07:00
2019-07-27T18:59:17.435039-07:00 2019-07-27T19:20:57.464819-07:00

and each trip has three sections

2019-07-27T17:38:54.143985-07:00 2019-07-27T17:54:56.504297-07:00
------- 2019-07-27T17:38:54.192643-07:00 2019-07-27T17:40:03.303200-07:00
------- 2019-07-27T17:40:03.318182-07:00 2019-07-27T17:52:26.823849-07:00
------- 2019-07-27T17:52:26.843096-07:00 2019-07-27T17:54:56.450234-07:00
2019-07-27T18:59:17.435039-07:00 2019-07-27T19:20:57.464819-07:00
------- 2019-07-27T18:59:17.495898-07:00 2019-07-27T19:01:06.611826-07:00
------- 2019-07-27T19:01:06.626976-07:00 2019-07-27T19:20:31.044772-07:00
------- 2019-07-27T19:20:31.060968-07:00 2019-07-27T19:20:57.402429-07:00

ok, so now let's see how the matching works.

There are two sensed trips and 9 sensed sections

2019-07-27T17:42:38.727000-07:00 2019-07-27T17:51:30-07:00
2019-07-27T19:03:05.796040-07:00 2019-07-27T19:21:19-07:00
=======
2019-07-27T17:42:38.727000-07:00 2019-07-27T17:51:10-07:00
2019-07-27T17:51:11-07:00 2019-07-27T17:51:30-07:00
2019-07-27T19:03:05.796040-07:00 2019-07-27T19:08:26-07:00
2019-07-27T19:08:27-07:00 2019-07-27T19:08:35-07:00
2019-07-27T19:08:36-07:00 2019-07-27T19:12:34-07:00
2019-07-27T19:12:35-07:00 2019-07-27T19:12:48-07:00
2019-07-27T19:12:49-07:00 2019-07-27T19:17:09-07:00
2019-07-27T19:17:10-07:00 2019-07-27T19:17:47-07:00
2019-07-27T19:17:49-07:00 2019-07-27T19:21:19-07:00

And there are matched sections for the evaluated trip ranges

2019-07-27T17:38:54.143985-07:00 2019-07-27T17:54:56.504297-07:00
------- 2019-07-27T17:42:38.727000-07:00 2019-07-27T17:51:10-07:00
------- 2019-07-27T17:51:11-07:00 2019-07-27T17:51:30-07:00
2019-07-27T18:59:17.435039-07:00 2019-07-27T19:20:57.464819-07:00
------- 2019-07-27T19:03:05.796040-07:00 2019-07-27T19:08:26-07:00
------- 2019-07-27T19:08:27-07:00 2019-07-27T19:08:35-07:00
------- 2019-07-27T19:08:36-07:00 2019-07-27T19:12:34-07:00
------- 2019-07-27T19:12:35-07:00 2019-07-27T19:12:48-07:00
------- 2019-07-27T19:12:49-07:00 2019-07-27T19:17:09-07:00
------- 2019-07-27T19:17:10-07:00 2019-07-27T19:17:47-07:00
------- 2019-07-27T19:17:49-07:00 2019-07-27T19:21:19-07:00

shankari commented 1 year ago

One issue is that the sensed_ranges all have data in them, while the evaluation ranges do not

range_0 = av_la.map()["android"]["ucb-sdb-android-2"]["evaluation_ranges"][0]
for t in range_0["sensed_trip_ranges"]:
    print(arrow.get(t["data"]["start_ts"]).to("America/Los_angeles"), arrow.get(t["data"]["end_ts"]).to("America/Los_angeles"))
print("=======")
for s in range_0["sensed_section_ranges"]:
    print(arrow.get(s["data"]["start_ts"]).to("America/Los_angeles"), arrow.get(s["data"]["end_ts"]).to("America/Los_angeles"))
print("=======")

for t in range_0["evaluation_trip_ranges"]:
    print(arrow.get(t["start_ts"]).to("America/Los_angeles"), arrow.get(t["end_ts"]).to("America/Los_angeles"))
    for s in t["sensed_section_ranges"]:
        print("-------", arrow.get(s["data"]["start_ts"]).to("America/Los_angeles"), arrow.get(s["data"]["end_ts"]).to("America/Los_angeles"))

But that should result in the matching failing with start_ts not found, not with missing section matches.

shankari commented 1 year ago

Ah, that's because this does work on android but apparently not on iOS. There are no sensed trip or section ranges

=======
=======
2019-07-27T17:38:54.143985-07:00 2019-07-27T17:54:56.504297-07:00
2019-07-27T18:59:17.435039-07:00 2019-07-27T19:20:57.464819-07:00

shankari commented 1 year ago

iOS2 works, but not iOS3

2019-07-27T17:38:54.143985-07:00 2019-07-27T17:54:56.504297-07:00
------- 2019-07-27T17:48:26.003052-07:00 2019-07-27T17:55:14.984543-07:00
2019-07-27T18:59:17.435039-07:00 2019-07-27T19:20:57.464819-07:00
------- 2019-07-27T19:02:21.000540-07:00 2019-07-27T19:04:49.996204-07:00
------- 2019-07-27T19:04:50.996161-07:00 2019-07-27T19:05:06.995687-07:00
------- 2019-07-27T19:05:07.995770-07:00 2019-07-27T19:12:40.996040-07:00
------- 2019-07-27T19:12:41.996006-07:00 2019-07-27T19:13:02.995287-07:00
------- 2019-07-27T19:13:03.995252-07:00 2019-07-27T19:14:55.991418-07:00
------- 2019-07-27T19:14:56.991384-07:00 2019-07-27T19:18:07.999237-07:00
------- 2019-07-27T19:18:08.999212-07:00 2019-07-27T19:18:24.998764-07:00
------- 2019-07-27T19:18:25.998734-07:00 2019-07-27T19:18:54.997776-07:00
------- 2019-07-27T19:18:55.997741-07:00 2019-07-27T19:19:48.995914-07:00
------- 2019-07-27T19:19:53.995742-07:00 2019-07-27T19:21:41.992005-07:00

shankari commented 1 year ago

For this first range, there are apparently no matching trips?!

ucb-sdb-ios-3 evaluation_1 dict_keys(['role', 'manual/evaluation_transition', 'calibration_transitions', 'calibration_ranges', 'evaluation_transitions', 'evaluation_ranges'])
         ==============================
         HAHFDC v/s MAHFDC:MAHFDC_0 HAHFDC v/s MAHFDC MAHFDC_0 2
Before filtering, trips = []
Filter range = 2019-07-27T17:38:54.143985-07:00 -> 2019-07-27T17:54:56.504297-07:00
After filtering, trips = []
Before filtering, trips = []
Filter range = 2019-07-27T18:59:17.435039-07:00 -> 2019-07-27T19:20:57.464819-07:00
After filtering, trips = []
         ==============================
         HAHFDC v/s MAHFDC:MAHFDC_1 HAHFDC v/s MAHFDC MAHFDC_1 2
Before filtering, trips = [('2019-07-28T10:23:13.947510-07:00', '2019-07-28T10:31:48.066216-07:00'), ('2019-07-28T10:31:54.494439-07:00', '2019-07-28T10:34:30.450632-07:00'), ('2019-07-28T11:50:42.000985-07:00', '2019-07-28T12:10:40.324661-07:00')]
Filter range = 2019-07-28T10:19:03.776588-07:00 -> 2019-07-28T10:32:24.080722-07:00
After filtering, trips = ['2019-07-28T10:23:13.947510-07:00', '2019-07-28T10:31:54.494439-07:00']
Before filtering, trips = [('2019-07-28T10:23:13.947510-07:00', '2019-07-28T10:31:48.066216-07:00'), ('2019-07-28T10:31:54.494439-07:00', '2019-07-28T10:34:30.450632-07:00'), ('2019-07-28T11:50:42.000985-07:00', '2019-07-28T12:10:40.324661-07:00')]
Filter range = 2019-07-28T11:48:06.675345-07:00 -> 2019-07-28T12:09:44.829831-07:00
After filtering, trips = ['2019-07-28T11:50:42.000985-07:00']

shankari commented 1 year ago

There are raw trips and sections but no cleaned or confirmed trips for this phone and trip combo

2B Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~cleaned_section/1564272465_1564284123.json
2B Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~cleaned_trip/1564272465_1564284123.json
2B Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~cleaned_untracked/1564272465_1564284123.json
2B Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~confirmed_trip/1564272465_1564284123.json
2B Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~inferred_section/1564272465_1564284123.json
72K Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~recreated_location/1564272465_1564284123.json
4.4K Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/segmentation~raw_section/1564272465_1564284123.json
4.6K Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/segmentation~raw_trip/1564272465_1564284123.json
2B Dec 10 19:35 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/segmentation~raw_untracked/1564272465_1564284123.json

shankari commented 1 year ago

ok, so for the trip to the location, it is very short and we basically have no data

2022-12-11 19:24:32,988:DEBUG:4396191232:Considering trip 63969ea3b10cfb28033e534d: 2019-07-27T17:58:13.428633-07:00 -> 2019-07-27T17:58:16.703112-07:00
...
2022-12-11 19:24:53,396:INFO:4396191232:Skipped single point trip 63969ea3b10cfb28033e534d (2019-07-27T17:58:13.428633-07:00 -> 2019-07-27T17:58:16.703112-07:00) of length 1.773354991055583
2022-12-11 19:24:53,396:DEBUG:4396191232:For raw trip 63969ea3b10cfb28033e534d, found filtered trip None

But how about the trip back?

2022-12-11 19:24:32,988:DEBUG:4396191232:Considering trip 63969ea3b10cfb28033e5351: 2019-07-27T19:04:36.508539-07:00 -> 2019-07-27T19:21:37.427862-07:00

2022-12-11 19:24:53,781:DEBUG:4396191232:Starting with element of type trip, id 63969f05b10cfb28033e66dc, details Entry({'_id': ObjectId('63969f05b10cfb28033e66dc'), 'user_id':  UUID('7ed80490-6853-433d-9d20-838fe4d3d71b'), 'metadata': Metadata({'key': 'analysis/cleaned_section'}, 'data': Cleanedsection({'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('63969f05b10cfb28033e66da'), 'start_ts': 1564279335.4815264, 'start_fmt_time': '2019-07-27T19:02:15.481526-07:00', 'start_loc': {'type': 'Point', 'coordinates':[-122.11348540560869, 37.38088791613373]}, 'end_ts': 1564280497.4278622, 'end_fmt_time': '2019-07-27T19:21:37.427862-07:00', 'end_loc': {'type': 'Point', 'coordinates': [-122.08372039576801, 37.390345769893756]}, 'duration': 1161.9463357925415, 'distance': 3705.457082358938, 'sensed_mode': 7})})
2022-12-11 19:24:53,791:DEBUG:4396191232:For raw trip 63969ea3b10cfb28033e5351, found filtered trip 63969f05b10cfb28033e66da

2022-12-11 19:25:43,721:DEBUG:4396191232:fix_squished_place: Fixed trip object = Cleanedtrip({'source': 'DwellSegmentationDistFilter', 'end_ts': 1564280497.4278622, 'end_fmt_time': '2019-07-27T19:21:37.427862-07:00', 'start_ts': 1564279305.4815264, 'start_fmt_time': '2019-07-27T19:01:45.481526-07:00', 'duration': 1191.9463357925415, 'distance': 3707.2304373499933})

Also

Inserting entry Entry({'user_id': UUID('7ed80490-6853-433d-9d20-838fe4d3d71b'), 'metadata': {'key': 'analysis/inferred_section'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('63969f05b10cfb28033e66da'), 'start_ts': 1564279305.4815264, 'start_local_dt': {'year': 2019, 'month': 7, 'day': 27, 'hour': 19, 'minute': 1, 'second': 45, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2019-07-27T19:01:45.481526-07:00', 'start_loc': {'type': 'Point', 'coordinates': [-122.1134678931847, 37.380895707027186]}, 'end_ts': 1564280497.4278622, 'end_fmt_time': '2019-07-27T19:21:37.427862-07:00',  'duration': 1191.9463357925415,

So we do have the trips and sections; why are we not retrieving them?

shankari commented 1 year ago

Two observations:

There are two queries for the segmentation/raw_trip but saved in one file

Dumping key segmentation/raw_trip for key_time = data.start_ts and phone ucb-sdb-ios-3
original range = 2019-07-27T17:37:45.212364-07:00 -> 2019-07-27T19:52:02.549677-07:00,padded range = 2019-07-2
7T17:07:45.212364-07:00 -> 2019-07-27T20:22:02.549677-07:00
Retrieving data for ucb-sdb-ios-3 from 1564272465.212364 -> 1564284122.549677
Retrieving data for: post_body={'user': 'ucb-sdb-ios-3', 'key_list': ['segmentation/raw_trip'], 'key_time': 'data.start_ts', 'start_time': 1564272465.212364, 'end_time': 1564284122.549677}
response=<Response [200]>
Found 2 entries
Retrieving data for ucb-sdb-ios-3 from 1670815395.355776 -> 1564284122.549677
Retrieving data for: post_body={'user': 'ucb-sdb-ios-3', 'key_list': ['segmentation/raw_trip'], 'key_time': 'data.start_ts', 'start_time': 1670815395.355776, 'end_time': 1564284122.549677}
response=<Response [200]>
Found 0 entries
Creating out_file='data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/segmentation~raw_trip/15642
72465_1564284123.json'...

There's one entry for cleaned trip

Dumping key analysis/cleaned_trip for key_time = data.start_ts and phone ucb-sdb-ios-3
original range = 2019-07-27T17:37:45.212364-07:00 -> 2019-07-27T19:52:02.549677-07:00,padded range = 2019-07-27T17:07:45.212364-07:00 -> 2019-07-27T20:22:02.549677-07:00
Retrieving data for ucb-sdb-ios-3 from 1564272465.212364 -> 1564284122.549677
Retrieving data for: post_body={'user': 'ucb-sdb-ios-3', 'key_list': ['analysis/cleaned_trip'], 'key_time': 'data.start_ts', 'start_time': 1564272465.212364, 'end_time': 1564284122.549677}
response=<Response [200]>
Found 1 entries
Creating out_file='data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~cleaned_trip/1564272465_1564284123.json'...

But that file has no data

ls -alh data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~cleaned_trip/1564272465_1564284123.json
2B Dec 11 22:00 data/master_9b70c97/ucb-sdb-ios-3/unimodal_trip_car_bike_mtv_la/analysis~cleaned_trip/1564272465_1564284123.json

shankari commented 1 year ago

so the two calls (including one where the start > end is because we continue reading until we have zero or one entry, and we pick the next batch as starting from the metadata.write_ts of the final entry in the batch. To be consistent with https://github.com/MobilityNet/mobilitynet.github.io/issues/31#issuecomment-1343805591 we need to set the second batch to start from the key_time of the last entry in the first batch.

The reason that the one entry is not saved is because of a very stupid bug that seems to have been around forever. If we only ever get one batch (e.g. only get exactly one entry), then location_entries is never added to. I guess we haven't hit this before because it is unlikely that we retrieve only one entry at a time. https://github.com/MobilityNet/mobilitynet-analysis-scripts/blob/master/emeval/input/spec_details.py#L160

shankari commented 1 year ago

wrt https://github.com/MobilityNet/mobilitynet.github.io/issues/31#issuecomment-1345790435

One issue is that the sensed_ranges all have data in them, while the evaluation ranges do not

It looks like we do expect to have ['data'] - and we do in the check_outlier_expanded

    print([(fmt(ssr["data"]["start_ts"]), fmt(ssr["data"]["end_ts"]), ssr["data"]["mode"])
           for ssr in eval_trip["sensed_section_ranges"]])

shankari commented 1 year ago

Ah but if we fix that, we run into another issue with the key

~/e-mission/mobilitynet-analysis-scripts/emeval/metrics/baseline_segmentation.py in find_matching_segments(gt_segments, id_key, sensed_segments)
     80             (len(gt_segments), len(sensed_segments)))
     81         for gt in gt_segments:
---> 82             start_segment_idx = find_closest_segment_idx(gt, sensed_segments, "start_ts")
     83             # We want to find the end segment id in the segments after the
     84             # start segment. So we filter the array passed in, and add back the

~/e-mission/mobilitynet-analysis-scripts/emeval/metrics/baseline_segmentation.py in find_closest_segment_idx(gt, sensed_segments, key)
     48 
     49 def find_closest_segment_idx(gt, sensed_segments, key):
---> 50     ts_diffs = [abs(gt[key] - st[key]) for st in sensed_segments]
     51     # import arrow
     52     # print("diffs for %s %s = %s" % (key, arrow.get(gt[key]).to("America/Los_Angeles"), ts_diffs))

~/e-mission/mobilitynet-analysis-scripts/emeval/metrics/baseline_segmentation.py in <listcomp>(.0)
     48 
     49 def find_closest_segment_idx(gt, sensed_segments, key):
---> 50     ts_diffs = [abs(gt[key] - st[key]) for st in sensed_segments]
     51     # import arrow
     52     # print("diffs for %s %s = %s" % (key, arrow.get(gt[key]).to("America/Los_Angeles"), ts_diffs))

KeyError: 'start_ts'

I then double-checked Evaluate_power_vs_classification, which works on transition matches and it also has a check_outlier which does not have a data.

I also checked the new classification code which is the main part that we need to get to work, and it has the fallback

        if 'data' in ss.keys():
            ss = ss['data']

So let's just strip out the data while creating the analysed timeline. If there are still issues, we can stop here, verify that the analysed timeline is working properly and move on to getting the notebooks for the paper done.