use labels to override trip mode

st-patrick commented 4 years ago

We want to use the transportation mode label to actually impact the metrics shown inside the dashboard.

Since the labels are currently "decoration only", we need to implement some code that e.g. adds an endpoint that will basically do the same as the userMetrics endpoint but use the label as mode of transportation if a label has been applied to that trip.

My question being: in which file would I best start to look for a place to implement this feature?

shankari commented 4 years ago

@st-patrick the metrics code in general is here https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/result/metrics

simple_metrics.py is where we actually generate the metrics - it depends upon the mode_section_grouped_df dataframe being computed correctly.

https://github.com/e-mission/e-mission-server/blob/8747e2279393f05f86e6be58f94de77909a4d455/emission/analysis/result/metrics/time_grouping.py#L118 mode_grouped_df = section_group_df.groupby('sensed_mode')

And here's where we group the sections by the sensed mode.

shankari commented 4 years ago

Let me think through the best way to modify that, for the record.

we should be consistent with the rest of the pipeline, and create new sections analysis/confirmed_sections where the mode is set to the overridden value. Then the change to the metrics code is trivial, we would just set the analysis results to analysis/confirmed_section and everything would Just Work.
But then we need to figure out how and when to generate these analysis/confirmed_section objects. Note that the trips may not yet be confirmed when the section is analysed. In that case, we can:
- generate analysis/confirmed_section objects only for sections that have been confirmed. This is not a great solution because it gets us back to the situation where we have to query for two kinds of objects and merge them. We could theoretically do this for the existing manual/trip_confirm objects anyway.
- generate analysis/confirmed_section objects for all sections. Essentially, we would have a 1:1:1 mapping between analysis/cleaned_section, analysis/inferred_section and analysis/confirmed_section. This seems like a much better option in terms of usability, so let us explore it further.

The biggest challenge with this approach is that the manual/trip_confirm objects are not synchronized with the analysis pipeline. So we need to handle the case where they appear before the analysis is run, as well as after the analysis is run.

So a rough outline of the proposed design is:

add a new wrapper class, confirmed_section, with fields for sensed_mode, overridden_mode and final_mode
add a new pipeline step at the end that creates confirmed_section objects with the following algorithm for filling in the final_mode:
- final_mode = overridden_mode if overridden_mode is not None else final_mode = sensed_mode
add a new pipeline step at the beginning that looks through the incoming objects and for every manual/trip_confirm, update the corresponding confirmed_section. Make sure to retain the manual/trip_confirm objects since our rule is that all incoming data is read-only and can never be modified.

You can then use the new objects by using the confirmed_section and final_mode instead of inferred_section and sensed_mode in the metrics. This should also potentially make the client code easier since you can retrieve confirmed_section directly instead of retrieving inferred_section and the manual/* override objects separately and merging them on the phone.

This is also consistent with the reproducible pipeline because:

even if we delete all the analysis/confirmed_section objects as part of deleting all analysis/* objects, the manual/* objects are retained
the first step of the pipeline (that processes incoming objects) will not find any manual/* objects, BUT
the final step of the pipeline (that matches against existing confirmations) will find the existing manual/* objects and incorporate them while creating analysis/inferred_section objects. This is crucial, and one of the main reasons, in addition to race conditions, and cases without internet connectivity, where we cannot only rely on processing the manual/* objects as they come in.

That seems like a pretty solid design with no holes.

@st-patrick let me know if you have additional questions @PatGendre @kafitz FYI for understanding design decisions for the future 😄

shankari commented 4 years ago

wrt

This should also potentially make the client code easier since you can retrieve confirmed_section directly instead of retrieving inferred_section and the manual/* override objects separately and merging them on the phone.

you would change inferred -> confirmed and sensed -> final in the GeoJSON export as well https://github.com/e-mission/e-mission-server/blob/master/emission/analysis/plotting/geojson/geojson_feature_converter.py

Then all the data returned from the /timeline/getTrips/<day> call will already have the overridden values in place. While reading unprocessed data, though, you will have to retain the existing code, since unprocessed = pipeline not run = no inferred sections or confirmed sections. But there might be performance benefits to not having to retrieve the manual/* objects for processed data.

shankari commented 4 years ago

@st-patrick you probably want to send out a draft PR once you have a significant chunk of the code written so that I can review and give feedback. Since this is a non-trivial change, probably best to make the development/review cycle interactive.

PatGendre commented 4 years ago

@shankari thanks, this will be a great feature, and thanks @st-patrick for working on it :-) It may also useful for the bicycle survey in dev in Nantes.

The solution you describe seem fine, still I have a question about the timing between pipeline and mode_confirm: if the pipeline is run daily say every night, if may well happen that the user confirms the mode for a trip the day after, or even a few days after, depending on the application (i.e. a weekly survey). I understand that the pipeline does not process data prior to the last processing date (except when reset), so here, the daily pipeline will not process the confirmed_section older than the last day, will it?

Another suggestion : is it possible to add two more fields in the confirmed_section : purpose (as it can be useful for many use cases), and say "custominfo", a field that could be used for asking the user any additional info at trip confirm; this would avoid the developer to create another field in the database, he/she would then just have to complete the display/dashboard feature if needed but not modify the data structure. I must admit I have not a direct use case for this request but it seems likely to be useful.

shankari commented 4 years ago

still I have a question about the timing between pipeline and mode_confirm: I understand that the pipeline does not process data prior to the last processing date (except when reset), so here, the daily pipeline will not process the confirmed_section older than the last day, will it?

You are absolutely correct the existing section segmentation and inference steps will not process the older confirmed_sections. But that's why I propose adding a "new pipeline step at the beginning that looks through the incoming objects and for every manual/trip_confirm, update the corresponding confirmed_section."

Note that every pipeline step manages its own last_processed_ts. This new pipeline step at the beginning would either check the objects before moving them from the usercache to the timeseries (e.g. before move to long_term) or it would update its own last_processed_ts based on the write timestamp.

So to return to your use case

if the user confirms the mode for a trip the day after, or even a few days after, depending on the application (i.e. a weekly survey)

then the confirmation will be saved as manual/mode_confirm in the local phone DB, it will be pushed up to the usercache as usual. When it is moved from the usercache to the timeseries, or right after that, the new pipeline step will find the corresponding section and modify it.

So, before the manual/mode_confirm is processed, the confirmed_section will have the automatically inferred value, after it is processed, it will have the overriden value. But it is an analysis output which will be deleted when the pipeline is reset, so it doesn't break the reproducibility guarantees.

@PatGendre Does this make sense?

shankari commented 4 years ago

is it possible to add two more fields in the confirmed_section : purpose (as it can be useful for many use cases), and say "custominfo", a field that could be used for asking the user any additional info at trip confirm; this would avoid the developer to create another field in the database, he/she would then just have to complete the display/dashboard feature if needed but not modify the data structure.

Definitely makes sense to add purpose. Not that sure about custom info because the standard algorithms can't process it without understanding its structure. Can we defer that until it is needed so that we don't overengineer?

PatGendre commented 4 years ago

@shankari

that's why I propose adding a "new pipeline step at the beginning that looks through the incoming objects and for every manual/trip_confirm, update the corresponding confirmed_section." ... So, before the manual/mode_confirm is processed, the confirmed_section will have the automatically inferred value, after it is processed, it will have the overriden value. But it is an analysis output which will be deleted when the pipeline is reset, so it doesn't break the reproducibility guarantees. @PatGendre Does this make sense?

Yes, it's very clever, I think it will work for the proposed used too :-)

Definitely makes sense to add purpose. Not that sure about custom info . Can we defer that until it is needed so that we don't overengineer?

Yes

shankari commented 4 years ago

Just for the record, I just wanted to point out a slight difference between this step and other previous steps. The previous steps are identical under replay. So every time you run the pipeline, the output of every intermediate step will be identical to the previous runs.

But for the proposed new handling_incoming_confirmations and generate_confirmed_sections steps, the first run after the confirmation is received will be potentially be different from subsequent runs. with @PatGendre's use case, in the first run, the confirmed_section will be changed by the handling_incoming_confirmations step; in subsequent runs, it will be generated by the generate_confirmed_sections.

This is not really an issue - as we have seen, the design will still work. But it is a subtle little difference that we should note for interaction with subsequent design decisions.

shankari commented 4 years ago

@jf87 fyi in case this is useful for you too 😄

st-patrick commented 4 years ago

Quite frankly I don't understand where I would even begin with these changes.

Over the last days, I have tried to get the trip data with the labels at every request but couldn't find anything, since dataframe and timeseries are not serializable and made debugging quite the headache.

Is there any way we could implement a serialization for that?

Also, I don't think I will have time to work on the above mentioned solution, simply because we only have one week left and I just don't have the comprehension of the server code that is needed for that.

But if you can draft something a little more practical and add some hints for debugging, especially how to access dataframe and timeseries data, that would be a great help. There's probably a really simple way I didn't see. I saw that in time_grouping you used .iloc[i] but since the data doesn't contain any labels at that point, it didn't really help my case.

shankari commented 4 years ago

@st-patrick dataframes are pandas dataframes. There are tons of pandas tutorials on the internet - it is part of the standard data science toolkit.

https://duckduckgo.com/?q=pandas+dataframe+tutorial&t=ffsb&ia=web should help you get started.

not sure what you mean by serializable, do you mean that they don't print properly? How are you trying to print them? I don't have time to test this right now, but per stackoverflow, print(df) should work and that's what I remember as well. https://stackoverflow.com/questions/49826909/how-to-print-out-dataframe-in-python

if you can draft something a little more practical

Here are even more detailed steps:

add a new wrapper class for confirmed_section (similar to https://github.com/e-mission/e-mission-server/pull/517)
add a new pipeline step in emission/pipeline/intake_stage.py
- emission/analysis/classification/inference/mode/rule_engine.py is an example of a simple pipeline step, where in runPredictionPipeline, you find the unprocessed sections, process them and save the results
- for the processing, for each unprocessed inferred section, find the corresponding confirm object (if any) and create a confirmed_section
change the analysis.result.section.key from conf/analysis/debug.conf.json to analysis/confirmed_section

This will get you a working solution in the case where the confirmation happens almost immediately.

Once I review that, the second pipeline step should be much easier and will handle the other case in which the confirmation comes later.

shankari commented 3 years ago

Since there have been requests for this from both DFKI and Heidelberg, and now I need it for the CEO e-bike project, I am going to tackle this issue now.

@lefterav @jf87 @EstherEU @PatGendre

PatGendre commented 3 years ago

@shankari that's great :-) this would be useful I guess if you can include purpose, not only overriden mode.

shankari commented 3 years ago

Picking this up again, one challenge is that the confirmation currently happens at the trip level and not the section level. How do we then deal with creating confirmed_section objects correctly.

One option is to only create confirmed_trip objects, not confirmed_section. If/when we support trip editing, we can create confirmed_section objects.

But then how would we deal with confirmed_trip objects that don't have user input associated with them. They can have multiple sections and we would want to use them.

This will be fixed if/when we have trip editing in place but we need to figure out what to do as a temporary workaround.

shankari commented 3 years ago

Another challenge is that some of the data is genuinely represented at the trip level. For example, a trip has a purpose, not a section.

PatGendre commented 3 years ago

@shankari I agree with you, the major difficulty may be that the mode and purpose buttons are at the trip level, so there is no clear way to induce modes at the section level.

Actually the mode button should be named "principal mode for the trip" and possibly could be pre-filled with the mode totalling most of the trip length, so that we could have a relation between trip (principal) mode and section modes (thus possibly changing section modes - i.e. confirmed section mode - if the principal mode is modified) ... but that would be complicated to implement and to understand for the end user!

shankari commented 3 years ago

@PatGendre @robfitzgerald @jf87 @asiripanich since all of you have worked with the data model, feedback would be appreciated

Here's a more detailed design

we will have both confirmed_trip and confirmed_section

confirmed_trip will have a field for confirmed_vals.
- On master, this will have primary_mode and purpose entries. For branches that use embedded surveys, this will have the survey JSON.
```
confirmed_trip
   - confirmed_vals
      - primary_mode
      - purpose
```
- confirmed_trip will also have an inferred_vals with a primary_mode entry on master. This will be the same even for branches that use embedded surveys, since they do not include any additional inference algorithms.
- the primary mode will be the mode of the primary section, as determined below

confirmed_section will also have user_input. The user_input will have only a mode entry. Only the primary section of a trip will have the user_input set.
Determining the primary section of a trip: *
if there is only one section, it is primary. Due to COVID, we are likely to have many unimodal trips now.
if there is more than one section, it is the longest section
Using the confirmed sections for calculations: * For most of the pre-defined modes, we can determine what the calculation factors (energy efficiency/carbon emissions) are. But we do allow users to enter their own modes, and it is not clear how we can handle those in calculations.

Since all our calculations are currently based on mode, if the confirmed_mode is one for which we don't have a calculation factor, we will use the sensed mode instead.

Whoo! That was a bit complicated but I think it works for now.

shankari commented 3 years ago

From a UI perspective, we will continue to show the confirmed_vals (if they exist) in the big buttons of trip diary and the inferred mode from the sections on the top - so this just simplifies and pre-computes the values for now.

However, in the dashboard and the CEO ebike gamification, we can switch to confirmed trips and confirmed sections. Concretely, for the CEO ebike gamification, we can use confirmed_trips to determine the number of ebike trips and the % of travel by ebike.

For the dashboard, we can switch to getting the metrics from confirmed_sections. But if we only get metrics, how do we do the mapping from unknown modes to the corresponding sensed mode? I guess we can do that mapping in the metrics calculation for now.

robfitzgerald commented 3 years ago

i agree with the general strategy here, to follow the pattern of building these immutable documents and not to lose any information. regarding the algorithmic selection of primary modes, a few thoughts (which both may not be helpful at this point):

if there is more than one section, it is the longest section

longest by distance or time? also, i could imagine this might differ by survey, where users may want to inject their own "primary section function" (like, for instance, "if {driving|transit} appears anywhere, set it as primary").

Since all our calculations are currently based on mode, if the confirmed_mode is one for which we don't have a calculation factor, we will use the sensed mode instead

might be helpful (?) to write this "analysis_mode" to a field so the user has a record of it.

Whoo!

🦉

shankari commented 3 years ago

longest by distance or time? also, i could imagine this might differ by survey, where users may want to inject their own

Good point. I was going to use distance, since the primary mode is typically motorized, and likely to be much faster than the modes used for the first and last leg.

"primary section function" (like, for instance, "if {driving|transit} appears anywhere, set it as primary").

my assumption was that if people wanted to change this, they would change it in the code. But I guess I could have people pass in a primary section function instead. I might not do that in this first pass though, pending evidence that people actually need it.

shankari commented 3 years ago

I guess we can do that mapping in the metrics calculation for now.

We can actually do this by adding an entry for mode_for_calculation for each section on the server. At some point, we can actually move the CO2 calculation to the server so it can be based on the energy profile of transportation in the trip location.

We originally started calculating values on the client because we were also calculating the calories burned and we didn't want to send over weight and height information to the server. But the CO2/EE calculations are making more and more sense on the server.

PatGendre commented 3 years ago

hi @shankari

here are a few thoughts:

primary section [...] it is the longest section

It is still not clear to me if the end-user will really understand what "primary section" vs "primary mode" means, and what it implies in terms of metrics calculation. An alternative would be to link between trip mode and section mode only if there is only one section, or if the longest section if obviously the primary section, say if its length is more then the half of the trip length. And otherwise to leave the primary section confirmed mode blank ... and wait until there is a full trip/section edition feature.

Since all our calculations are currently based on mode, if the confirmed_mode is one for which we don't have a calculation factor, we will use the sensed mode instead.

This a reasonable rule. In the future a feature could be added so that the end-user can state that the mode is similar to an existing mode, for example wheel chair is similar to walking in terms of calculation, and e-scooter similar to ebike... Anyway as long as the calculation parameters are not personalised for each end-user, the calculations are only indicative (e.g for car the emission figures can vary a lot from a model to another).

We originally started calculating values on the client because we were also calculating the calories burned and we didn't want to send over weight and height information to the server. But the CO2/EE calculations are making more and more sense on the server.

I agree, as long as the weight/height privacy can be managed and/or the end-user agrees to send these personal data to the server.

shankari commented 3 years ago

I agree, as long as the weight/height privacy can be managed and/or the end-user agrees to send these personal data to the server.

I think we would split the calculations. CO2/EE would be on the server; calorie (which doesn't depend on motorized mode) would be on the phone.

shankari commented 3 years ago

It is still not clear to me if the end-user will really understand what "primary section" vs "primary mode" means, and what it implies in terms of metrics calculation. An alternative would be to link between trip mode and section mode only if there is only one section, or if the longest section if obviously the primary section, say if its length is more then the half of the trip length. And otherwise to leave the primary section confirmed mode blank ... and wait until there is a full trip/section edition feature.

I want to clarify that this will not necessarily affect the UI at this time. The UI currently displays section modes at the top of the card, and displays trip mode overrides at the bottom. The proposed data model will allow us to continue doing that.

The main user-visible difference will be in the dashboard and the calculations.
The main deployer-level difference will be in the dashboard.

PatGendre commented 3 years ago

@shankari Thank you for clarifying, I did not get that point.

The main user-visible difference will be in the dashboard and the calculations

Also, the diary screen code could label automatically the primary mode button (with the primary section inferred mode when a section makes >50% of the length of the trip) but it might not be very useful.

shankari commented 3 years ago

Also, the diary screen code could label automatically the primary mode button (with the primary section inferred mode when a section makes >50% of the length of the trip) but it might not be very useful.

Yes, that would also be confusing, as you pointed out earlier. We already show the inferred mode at the top of the trip card, so it is not like the information is missing. We can change this later if we have time to run some user tests.

shankari commented 3 years ago

while implementing this, I modified the server code to be consistent with the phone. And found that there were user inputs that didn't match any trips. While investigating that further, I discovered that:

There was a bug in the unit tests for this feature where we were always matching the first trip
On fixing that bug, I discovered, that at least a few trips did not have any matches.

Experimenting further, the trips do have matches before the pipeline is run, but the matches break once the pipeline runs. Need to investigate this and fix both phone and server implementations.

shankari commented 3 years ago

In draft mode, we have: 9:19 -> 9:32: Bike 9:34 -> 9:55: Bike 5:28 -> 5:49: Bike 5:54 -> 6:46: Walk, not a trip 7:02 -> 7:50: Bike

After increasing the "end of trip" buffer to 15 mins, we get: 9:19 -> 9:29: Bike 9:33 -> 9:46: Bike 5:22 -> 5:56: Bike 6:54 -> 7:21: blank

The last entry doesn't match because the gap is fairly large (30 mins). On the phone, any attempt at fixing that would require additional server calls. But on the server, we could try to check the raw trips.

shankari commented 3 years ago

This is really weird. The cleaned trips are:

 {'_id': ObjectId('5fda8a44b368d4a4b76d0042'),
  'data': {'start_fmt_time': '2016-12-12T17:22:22.062618-08:00'},
   'end_fmt_time': '2016-12-12T17:56:53.030000-08:00',
   },
 {'_id': ObjectId('5fda8a45b368d4a4b76d008b'),
  'data': {'start_fmt_time': '2016-12-12T18:54:58.134886-08:00'},
   'end_fmt_time': '2016-12-12T19:21:30.623000-08:00'}]

The raw trips are:

 {'_id': ObjectId('5fda8a42b368d4a4b76cffe0'),
  'data': {'start_fmt_time': '2016-12-12T17:27:24.524000-08:00',
   'end_fmt_time': '2016-12-12T17:56:53.030000-08:00'}},
 {'_id': ObjectId('5fda8a42b368d4a4b76cffe2'),
  'data': {'start_fmt_time': '2016-12-12T18:07:22.524000-08:00',
   'end_fmt_time': '2016-12-12T18:09:27.147000-08:00'}},
 {'_id': ObjectId('5fda8a42b368d4a4b76cffe4'),
  'data': {'start_fmt_time': '2016-12-12T18:38:25.007000-08:00',
   'end_fmt_time': '2016-12-12T18:39:59.749000-08:00'}},
 {'_id': ObjectId('5fda8a42b368d4a4b76cffe6'),
  'data': {'start_fmt_time': '2016-12-12T19:02:04.350000-08:00',
   'end_fmt_time': '2016-12-12T19:21:30.623000-08:00'}},
 {'_id': ObjectId('5fda8a42b368d4a4b76cffe8'),
  'data': {'start_fmt_time': '2016-12-12T19:27:35.382000-08:00',
   'end_fmt_time': '2016-12-12T19:29:05.394000-08:00'}},
 {'_id': ObjectId('5fda8a42b368d4a4b76cffea'),
  'data': {'start_fmt_time': '2016-12-12T19:46:59.088000-08:00',
   'end_fmt_time': '2016-12-12T19:48:02.122000-08:00'}}]

draft trips: 5:28 -> 5:49: 5:54 -> 6:46: 7:02 -> 7:50:

cleaned trips: 5:22 -> 5:56 6:54 -> 7:21

raw trips: 5:27 -> 5:56 6:07 -> 6:09 6:38 -> 6:39 7:02 -> 7:21 7:27 -> 7:29 7:46 -> 7:48

shankari commented 3 years ago

So I don't think that the raw trips will work either.

I can think of two potential fixes. First, on the phone, we change the first part of the check to trip_start <= ui_start <= trip_end. This is consistent with the query on the server, so we can skip that part, or retain it for maintainability.

For the second part, if the current checks fail, on the server, we can check to see:

whether the ui_end is before the next trip start
whether the ui_end is within some buffer of the next transition, since we assume that the user filled it out in draft mode

On the phone, check (2) would be require another server call. check (1) could be done locally iff it was not the last trip of the day. But we will also change the geojson generation code to use confirmed_trips, so eventually the server code will be the only code.

And this is already broken on the phone, so it is not like it will break any further.

shankari commented 3 years ago

I ended up fixing this on the phone too with option (1), just to ensure that it worked. In case this was the last trip of the day, I checked to ensure that the user input didn't span two days, and then returned true. With this fix, all user inputs are visible even for cleaned trips.

shankari commented 3 years ago

Ah, but there is a wrinkle.

Both the draft trips 5:28 -> 5:49: 5:54 -> 6:46:

match the new criteria for cleaned trip (5:22 -> 5:56). And assuming the user went through and labelled in order, the second label will be more recent than the first, and will be returned, although the first one is clearly the better match. This is because the first one overlaps more substantially than the second. Need to make the selection of the potential candidates more sophisticated as well.

But that is super complicated because you are going to have dueling orders and we have to put some thresholds anyway. Let's just add a validity check which says that the overlap should be at least 50% for the user label to make sense. That makes sense overall, because if it was off by a lot, then maybe the user label was not that relevant anyway.

shankari commented 3 years ago

That seems to fix it for this case and it seems pretty straightforward as well 👍 Another option is to just reject "Not a trip" entries.

For the 5:22 -> 5:56 trip

Cleaned trip: comparing user = 2016-12-12T09:19:37-08:00 -> 2016-12-12T09:32:11-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:34:16-08:00 -> 2016-12-12T09:55:07-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T17:28:44-08:00 -> 2016-12-12T17:49:48-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are true && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:34:16-08:00 -> 2016-12-12T09:55:07-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:19:37-08:00 -> 2016-12-12T09:32:11-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T17:54:51-08:00 -> 2016-12-12T18:46:37-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are true && true end checks are false || false)
Second level of end checks when the next trip is defined(1481597197.113 <= 1481597698.1348858) = true
Flipped endCheck, overlap(121.53799986839294)/trip(2070.9673824310303) = 0.05868658333272448
Cleaned trip: comparing user = 2016-12-12T19:02:04-08:00 -> 2016-12-12T19:50:04-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are true && false end checks are false || false)
In getUserInputForTripStartEnd, one potential candidate, returning  2016-12-12T17:28:44-08:00(1481592524.076) -> 2016-12-12T17:49:48-08:00(1481593788.738) pick_drop logged at 1516072736.493696

For the 6:54 -> 7:21 case:

Cleaned trip: comparing user = 2016-12-12T09:19:37-08:00 -> 2016-12-12T09:32:11-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:34:16-08:00 -> 2016-12-12T09:55:07-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T17:28:44-08:00 -> 2016-12-12T17:49:48-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:34:16-08:00 -> 2016-12-12T09:55:07-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:19:37-08:00 -> 2016-12-12T09:32:11-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T17:54:51-08:00 -> 2016-12-12T18:46:37-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T19:02:04-08:00 -> 2016-12-12T19:50:04-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are true && true end checks are false || false)
Second level of end checks for the last trip of the day
compare 12 with 12 = true
Flipped endCheck, overlap(1166.2730000019073)/trip(1592.488114118576) = 0.732358998263184
In getUserInputForTripStartEnd, one potential candidate, returning  2016-12-12T19:02:04-08:00(1481598124.35) -> 2016-12-12T19:50:04-08:00(1481601004.01) pick_drop logged at 1516072777.245432

And server-side test passes as well

----------------------------------------------------------------------
Ran 1 test in 9.620s

shankari commented 3 years ago

One final consideration while creating the pipeline is how to set the timestamps for pipeline states. We will need to use the write_ts of the user input but the end_ts of the confirmed trips. Let's walk through how this will work in both cases and ensure that there are no surprises.

user confirms in draft mode
- user inputs are copied to the timeseries in the first step
- no matching confirmed trips, pipeline state is not updated
- we create confirmed trips in the last step
- match with the user inputs from the timeseries
- on the next pipeline run, we will have to handle the user inputs again

Let's try a different approach:

user confirms in draft mode
- user inputs are copied to the timeseries in the first step
- no matching confirmed trips, pipeline state is updated to the last write_ts since we have "processed" them
- we create confirmed trips in the last step
- match with the user inputs from the timeseries
- on the next pipeline run, user inputs are already processed
user confirms in confirmed mode
- first pipeline run, we create confirmed trips on the last step
- no match for user inputs
- user inputs arrive,
- next pipeline run, are copied to the timeseries in the first step
- matching confirmed trips, pipeline state is updated to the last write_ts
- on the last step, confirmed trips have already been processed, so no change necessary

shankari commented 3 years ago

Fixed in https://github.com/e-mission/e-mission-server/pull/780

shankari commented 3 years ago

Actually, that only fixed the trips, we also need to fix sections.

PatGendre commented 3 years ago

Hi @shankari you've been improving the trip labeling process lastly, and introduced a new confirmed_trip key, but it is not clear to me, I still have at least 2 questions :

can the app dashboard already take the user labelled modes in the indicators computations?
if yes : currently the user can only label the trips not the sections, so how do you replace the user labeled modes at the section level for the computation, what's the heuristics (e.g. you replace only the detected mode with the user mode for the largest section)? (I didn't find in the code)

shankari commented 3 years ago

@PatGendre the deployer dashboard (emdash) uses the user labeled modes, but the in-app dashboard (the "metrics" screen) does not, at least partially because I wasn't sure how to deal with the mismatch between trip and section labeling that you outlined.

The A-mission branch of the project (https://github.com/xubowenhaoren/A-Mission) from UW supports confirming sections, along with a bunch of accessibility improvements. If somebody wanted to merge the changes over to master, it would help a lot wrt modifying the in-app dashboard as well.

PatGendre commented 3 years ago

@shankari thanks ! Yann will have a look at A-mission, of course I'll tell you if we envisage merging this appealing section labeling into master (we'd have to find a budget).

PatGendre commented 3 years ago

@shankari FYI as there is no budget for improving to complete the labeling feature with indicators taking into account the modes labeled manually by the user, it was decided to remove the labeling feature (at least for a few months), because we think the user won't understand why the (mode, purpose) labels he enters have no effect on the dashboard indicators.

And I have another question, as I've seen that GabrielKS is implementing a "label inference pipeline" : do you have a kind of "functional spec" of what this pipeline will do (on the server and on the app side)? Thanks

shankari commented 3 years ago

@PatGendre I apologize for the delay in responding to your comments about the confirmed_trip objects, but the CanBikeCO deployments are just starting up, and I was on vacation for a week.

And I have a couple of interns working on improving the labeling by determining common and novel trips. The related issues are: for the analysis: https://github.com/e-mission/e-mission-docs/issues/606 for the system integration: https://github.com/e-mission/e-mission-docs/issues/647

This is a key component of our ongoing work since there is information that we cannot automatically detect - e.g. purpose and replaced mode. So we have to urgently reduce the user labeling burden.

PatGendre commented 3 years ago

@shankari no problem, it is not an urgent question for us!
Thanks for your reply, I understand better what you intend to do.
If I understand well enough, with what your interns will implement, there will still be the question of labeling modes à trip level while actual mode is at the section level.

FYI Fouad and Yann worked on clustering stops outside of e-mission in postgis in 2019, so as to produce statistics on frequent places and frequent trips between places, even if we didn't intend to try ML on automatic mode/purpose labeling, it was already interesting.

We would like to do this again with la Rochelle, but rather in a python notebook than in postgis, but (looking at it quickly) I've found the k-means clustering methods of postgis available in shapely for python...

e-mission / e-mission-docs

use labels to override trip mode #476

Here's a more detailed design