🗃️ Add an interface to support returning the cleaned section <-> inferred section mapping for a set of cleaned sections

shankari commented 10 months ago

This will allow us to have a generic interface for use by the dashboards while optimizing the implementation later. This is currently needed for https://github.com/e-mission/op-admin-dashboard/commit/6cdf8e6d82c3adf3a969d5d276a5fb6886857a76#diff-1c6b8e6d103286796ce21a8276c4a4d8b258e29d6b9cc6df516a92accf4674d1R201

The desired interface would be something like: cleaned2inferred_section_list, similar to the current cleaned2inferred_section but with a list passed in. The initial implementation could be the simple loop at: https://github.com/e-mission/op-admin-dashboard/commit/6cdf8e6d82c3adf3a969d5d276a5fb6886857a76#diff-1c6b8e6d103286796ce21a8276c4a4d8b258e29d6b9cc6df516a92accf4674d1R201-R206

A performance optimization would be the original implementation with https://github.com/e-mission/op-admin-dashboard/commit/6cdf8e6d82c3adf3a969d5d276a5fb6886857a76#diff-1c6b8e6d103286796ce21a8276c4a4d8b258e29d6b9cc6df516a92accf4674d1L199-L202

Although, given our data model, I would prefer an optimization in which we retrieved potentially matching inferred modes by time range or geo-range and then matched them up in memory. In general, with the timeseries data model, we want to avoid using the linkages (the foreign keys) between collections because they would not necessarily be searchable in a real timeseries database. They are more of a relational data model.

If we did go with the timeseries approach, we could also close https://github.com/e-mission/e-mission-server/pull/934

@TTalex

shankari commented 10 months ago

One problem with using a range for this specific use case is that we get the list of section ids from the list of points that were within a polygon. This could span a wide date range (e.g. months) or a wide geographic range (the trajectories passed through a location but the start and end could be anywhere).

An intermediate tradeoff could be that we still use the time range, but split it up so that we don't have to read too many sections at a time. Note that this is similar to the $in approach, where we might also want to chunk by 100 values at a time.

# Note: for performance reasons, it is not recommended to use '$in' a list bigger than ~100 values
# In our use case, this could happen on popular trips, but the delay is deemed acceptable

we can then continue to use the timeseries-based data model but not have to make $O(n)$ queries to the server. Other, more complex approaches are to:

backfill the inferred section ID into the recreated location (which breaks the forward flow of data through the pipeline by adding a backwards dependency)
add a convenience mapping between cleaned section id and inferred section id that we can read in completely and then work with as needed. This will ensure that we isolate the performance hit of the mapping to the use cases that actually need it.

shankari commented 10 months ago

@MukuFlash03 here's the next issue for you to work on

shankari commented 9 months ago

I see that the current implementation with the loop requires a user_id and a section_id to be passed in. For the batch method, you can take in a list of user_ids and section_ids or a list of {user_id, section_id} dictionaries. Essentially you can go from one of those representations to the other either by doing zip or a list comprehension that splits it out.

or take only section ids and just implement the performance optimization for now

I think we will have to tweak the interface a bit over time and polish it depending on new use case that come in For now, I just want to have a reasonable implementation and a way to find all uses for when we polish later.

MukuFlash03 commented 9 months ago

Since the initial code implementation was ready, I thought of first adding the required functionality before optimizing and also worked on the tests.

I saw that the functionality involved the keys analysis/inferred_section and analysis/cleaned_section and doing a grep in the tests/data folder for analysis/inferred_section returned one file:

$ grep -nr “analysis/inferred_section” emission/tests/data
=> jack_untracked_time_2023-03-12.inferred_section.expected_composite_trips

My doubt is whether this is the right file that I should be using for testing the section queries?

I have this concern since the sample data format does not match query being formed to fetch data in _get_inference_entry_for_section() which is the function that is being used by cleaned2inferred_section().

With respect to this data file, 1) The current format in this sample data file is JSON data where the actual "sections" info with the metadata.key = analysis/inferred_section is present in a nested block. {id, user_id, metadata, data: {sections: [{analysis/inferred_key and section_id present here}]}

2) Currently, _get_inference_entry_for_section() runs this query on the outermost JSON data which doesn’t have “sections”. And the metadata.key for outer parent dictionary is “analysis/composite_trip” which doesn’t match with “analysis/inferred_section”.

Will work on code implementation again for now, then move back to testing.

MukuFlash03 commented 9 months ago

I also see that the code uses the analysis timeseries db to query for data. I did try using the data file mentioned above by using etc.setUpRealExample() but this loads data into the timeseries db and not the analysis timeseries db as this function itself loads into the timeseries db specifically.

So, I believe this is not the right way to test functionality involving analysis timeseries db.

The other functions in emission/storage/decorations/section_queries.py like get_sections_for_trip() and get_sections_for_trip_list() are tested by creating a new section and inserting it into the analysis timeseries db using builtin_timeseries.insert() which uses the metadata.key to insert into the appropriate database which in this case is the analysis timeseries db.

So, now also considering this testing approach but need to see how sensed_mode is to be set and accessed.

shankari commented 9 months ago

after setting up the example, you need to run the pipeline. That will create the analysis results. Please also see chapter 5 of my thesis to understand how the pipeline works

MukuFlash03 commented 9 months ago

The issue I am facing is, after setting up example and running intake pipeline, the analysis timeseries data contains only analysis/cleaned_section keys with sections but no analysis/inferred_section keys.

I also think the function cleaned2inferred_section may actually do what its name suggests, however cleaned2inferred_section code uses analysis/inferred_section as the key to use when querying data. But the data it filters on which is present in the analysis db, doesn't contain this key. It only contains analysis/cleaned_section

I am unsure whether I first have to manually convert cleaned_section data to inferred_section? How would I first create my test data containing inferred_sections?

So, I have been trying to understand the entire data flow from setting up example datasets to obtaining the analysis data in the appropriate timeseries dbs. I studied that there are some pipeline stages followed: filtering for accuracy (if required), trip and section segmentation, smoothing sections, cleaning and resampling data, mode inference.

I found the pipeline implementation in emission/pipeline/intake_stage.py. Also, emission.analysis.classification.inference.mode.pipeline.py contains code for mode inference stage, which includes inserting data into timeseries dbs.

However, Comments here say that the intake pipeline for mode inference testing may not be correct. This testing in this file creates the mode inference data step by step through interdependent tests.

I do see that a sample dataset exists which progresses towards getting inferred_section keys, but I’m not sure how the inferred_section file was created. The 1st one is the raw data, while 2nd one results from running intake pipeline.

jack_untracked_time_2023-03-12
jack_untracked_time_2023-03-12.expected_composite_trips 
jack_untracked_time_2023-03-12.inferred_section.expected_composite_trips

Also, I do see emission.run model pipeline for mode inference but the code looks incomplete and unused anywhere else.

Still trying to understand how to generate inferred_sections.

MukuFlash03 commented 9 months ago

We currently have two mode inference algorithms - one based on Random Forest from sensor data (speed, acceleration...) Another one based on GIS integration - make queries to Overpass API to read osm data. Then see whether section starts and ends are near a bus stop. Walk, Bus, Walk -> Look for bus tops within 100 m of motorized path. If there is a matching bus, we know whether we took transit and/or which specific mode bus/train taken.

Seed_model.json is saved random forest model. It is typically not checked in since the initial version was collected from an informal class-environment. For inferred_sections, in unit testing directories, there is a seed_model.json built from some test data. Should be able to copy over this data, run the test and then remove it.

Alternative, use GIS based testing branch, which may eventually become master branch and current branch becomes random-forest branch.

shankari commented 9 months ago

Fixed in https://github.com/e-mission/e-mission-server/pull/937

e-mission / e-mission-docs

🗃️ Add an interface to support returning the cleaned section <-> inferred section mapping for a set of cleaned sections #970