Open st-patrick opened 4 years ago
@st-patrick the metrics code in general is here https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/result/metrics
simple_metrics.py
is where we actually generate the metrics - it depends upon the mode_section_grouped_df
dataframe being computed correctly.
https://github.com/e-mission/e-mission-server/blob/8747e2279393f05f86e6be58f94de77909a4d455/emission/analysis/result/metrics/time_grouping.py#L118 mode_grouped_df = section_group_df.groupby('sensed_mode')
And here's where we group the sections by the sensed mode.
Let me think through the best way to modify that, for the record.
analysis/confirmed_sections
where the mode is set to the overridden value. Then the change to the metrics code is trivial, we would just set the analysis results to analysis/confirmed_section
and everything would Just Work.analysis/confirmed_section
objects. Note that the trips may not yet be confirmed when the section is analysed. In that case, we can:
analysis/confirmed_section
objects only for sections that have been confirmed. This is not a great solution because it gets us back to the situation where we have to query for two kinds of objects and merge them. We could theoretically do this for the existing manual/trip_confirm
objects anyway.analysis/confirmed_section
objects for all sections. Essentially, we would have a 1:1:1 mapping between analysis/cleaned_section
, analysis/inferred_section
and analysis/confirmed_section
. This seems like a much better option in terms of usability, so let us explore it further.The biggest challenge with this approach is that the manual/trip_confirm
objects are not synchronized with the analysis pipeline. So we need to handle the case where they appear before the analysis is run, as well as after the analysis is run.
So a rough outline of the proposed design is:
confirmed_section
, with fields for sensed_mode
, overridden_mode
and final_mode
confirmed_section
objects with the following algorithm for filling in the final_mode
:
final_mode = overridden_mode if overridden_mode is not None else final_mode = sensed_mode
manual/trip_confirm
, update the corresponding confirmed_section
. Make sure to retain the manual/trip_confirm
objects since our rule is that all incoming data is read-only and can never be modified.You can then use the new objects by using the confirmed_section
and final_mode
instead of inferred_section
and sensed_mode
in the metrics. This should also potentially make the client code easier since you can retrieve confirmed_section
directly instead of retrieving inferred_section
and the manual/*
override objects separately and merging them on the phone.
This is also consistent with the reproducible pipeline because:
analysis/confirmed_section
objects as part of deleting all analysis/*
objects, the manual/*
objects are retainedmanual/*
objects, BUTmanual/*
objects and incorporate them while creating analysis/inferred_section
objects. This is crucial, and one of the main reasons, in addition to race conditions, and cases without internet connectivity, where we cannot only rely on processing the manual/*
objects as they come in.That seems like a pretty solid design with no holes.
@st-patrick let me know if you have additional questions @PatGendre @kafitz FYI for understanding design decisions for the future 😄
wrt
This should also potentially make the client code easier since you can retrieve confirmed_section directly instead of retrieving inferred_section and the manual/* override objects separately and merging them on the phone.
you would change inferred -> confirmed and sensed -> final in the GeoJSON export as well https://github.com/e-mission/e-mission-server/blob/master/emission/analysis/plotting/geojson/geojson_feature_converter.py
Then all the data returned from the /timeline/getTrips/<day>
call will already have the overridden values in place. While reading unprocessed data, though, you will have to retain the existing code, since unprocessed = pipeline not run = no inferred sections or confirmed sections. But there might be performance benefits to not having to retrieve the manual/*
objects for processed data.
@st-patrick you probably want to send out a draft PR once you have a significant chunk of the code written so that I can review and give feedback. Since this is a non-trivial change, probably best to make the development/review cycle interactive.
@shankari thanks, this will be a great feature, and thanks @st-patrick for working on it :-) It may also useful for the bicycle survey in dev in Nantes.
The solution you describe seem fine, still I have a question about the timing between pipeline and mode_confirm: if the pipeline is run daily say every night, if may well happen that the user confirms the mode for a trip the day after, or even a few days after, depending on the application (i.e. a weekly survey). I understand that the pipeline does not process data prior to the last processing date (except when reset), so here, the daily pipeline will not process the confirmed_section older than the last day, will it?
Another suggestion : is it possible to add two more fields in the confirmed_section : purpose (as it can be useful for many use cases), and say "custominfo", a field that could be used for asking the user any additional info at trip confirm; this would avoid the developer to create another field in the database, he/she would then just have to complete the display/dashboard feature if needed but not modify the data structure. I must admit I have not a direct use case for this request but it seems likely to be useful.
still I have a question about the timing between pipeline and mode_confirm: I understand that the pipeline does not process data prior to the last processing date (except when reset), so here, the daily pipeline will not process the confirmed_section older than the last day, will it?
You are absolutely correct the existing section segmentation and inference steps will not process the older confirmed_section
s. But that's why I propose adding a "new pipeline step at the beginning that looks through the incoming objects and for every manual/trip_confirm
, update the corresponding confirmed_section
."
Note that every pipeline step manages its own last_processed_ts
. This new pipeline step at the beginning would either check the objects before moving them from the usercache to the timeseries (e.g. before move to long_term
) or it would update its own last_processed_ts
based on the write timestamp.
So to return to your use case
if the user confirms the mode for a trip the day after, or even a few days after, depending on the application (i.e. a weekly survey)
then the confirmation will be saved as manual/mode_confirm
in the local phone DB, it will be pushed up to the usercache as usual. When it is moved from the usercache to the timeseries, or right after that, the new pipeline step will find the corresponding section and modify it.
So, before the manual/mode_confirm
is processed, the confirmed_section
will have the automatically inferred value, after it is processed, it will have the overriden value. But it is an analysis output which will be deleted when the pipeline is reset, so it doesn't break the reproducibility guarantees.
@PatGendre Does this make sense?
is it possible to add two more fields in the confirmed_section : purpose (as it can be useful for many use cases), and say "custominfo", a field that could be used for asking the user any additional info at trip confirm; this would avoid the developer to create another field in the database, he/she would then just have to complete the display/dashboard feature if needed but not modify the data structure.
Definitely makes sense to add purpose. Not that sure about custom info because the standard algorithms can't process it without understanding its structure. Can we defer that until it is needed so that we don't overengineer?
@shankari
that's why I propose adding a "new pipeline step at the beginning that looks through the incoming objects and for every manual/trip_confirm, update the corresponding confirmed_section." ... So, before the manual/mode_confirm is processed, the confirmed_section will have the automatically inferred value, after it is processed, it will have the overriden value. But it is an analysis output which will be deleted when the pipeline is reset, so it doesn't break the reproducibility guarantees. @PatGendre Does this make sense?
Yes, it's very clever, I think it will work for the proposed used too :-)
Definitely makes sense to add purpose. Not that sure about custom info . Can we defer that until it is needed so that we don't overengineer?
Yes
Just for the record, I just wanted to point out a slight difference between this step and other previous steps. The previous steps are identical under replay. So every time you run the pipeline, the output of every intermediate step will be identical to the previous runs.
But for the proposed new handling_incoming_confirmations
and generate_confirmed_sections
steps, the first run after the confirmation is received will be potentially be different from subsequent runs. with @PatGendre's use case, in the first run, the confirmed_section
will be changed by the handling_incoming_confirmations
step; in subsequent runs, it will be generated by the generate_confirmed_sections
.
This is not really an issue - as we have seen, the design will still work. But it is a subtle little difference that we should note for interaction with subsequent design decisions.
@jf87 fyi in case this is useful for you too 😄
Quite frankly I don't understand where I would even begin with these changes.
Over the last days, I have tried to get the trip data with the labels at every request but couldn't find anything, since dataframe and timeseries are not serializable and made debugging quite the headache.
Is there any way we could implement a serialization for that?
Also, I don't think I will have time to work on the above mentioned solution, simply because we only have one week left and I just don't have the comprehension of the server code that is needed for that.
But if you can draft something a little more practical and add some hints for debugging, especially how to access dataframe and timeseries data, that would be a great help. There's probably a really simple way I didn't see. I saw that in time_grouping you used .iloc[i] but since the data doesn't contain any labels at that point, it didn't really help my case.
@st-patrick dataframes are pandas dataframes. There are tons of pandas tutorials on the internet - it is part of the standard data science toolkit.
https://duckduckgo.com/?q=pandas+dataframe+tutorial&t=ffsb&ia=web should help you get started.
not sure what you mean by serializable, do you mean that they don't print properly? How are you trying to print them? I don't have time to test this right now, but per stackoverflow, print(df)
should work and that's what I remember as well.
https://stackoverflow.com/questions/49826909/how-to-print-out-dataframe-in-python
if you can draft something a little more practical
Here are even more detailed steps:
confirmed_section
(similar to https://github.com/e-mission/e-mission-server/pull/517)emission/pipeline/intake_stage.py
emission/analysis/classification/inference/mode/rule_engine.py
is an example of a simple pipeline step, where in runPredictionPipeline
, you find the unprocessed sections, process them and save the resultsinferred section
, find the corresponding confirm object (if any) and create a confirmed_section
analysis.result.section.key
from conf/analysis/debug.conf.json
to analysis/confirmed_section
This will get you a working solution in the case where the confirmation happens almost immediately.
Once I review that, the second pipeline step should be much easier and will handle the other case in which the confirmation comes later.
Since there have been requests for this from both DFKI and Heidelberg, and now I need it for the CEO e-bike project, I am going to tackle this issue now.
@lefterav @jf87 @EstherEU @PatGendre
@shankari that's great :-) this would be useful I guess if you can include purpose, not only overriden mode.
Picking this up again, one challenge is that the confirmation currently happens at the trip level and not the section level. How do we then deal with creating confirmed_section
objects correctly.
One option is to only create confirmed_trip
objects, not confirmed_section
. If/when we support trip editing, we can create confirmed_section
objects.
But then how would we deal with confirmed_trip
objects that don't have user input associated with them. They can have multiple sections and we would want to use them.
This will be fixed if/when we have trip editing in place but we need to figure out what to do as a temporary workaround.
Another challenge is that some of the data is genuinely represented at the trip level. For example, a trip has a purpose, not a section.
@shankari I agree with you, the major difficulty may be that the mode and purpose buttons are at the trip level, so there is no clear way to induce modes at the section level.
Actually the mode button should be named "principal mode for the trip" and possibly could be pre-filled with the mode totalling most of the trip length, so that we could have a relation between trip (principal) mode and section modes (thus possibly changing section modes - i.e. confirmed section mode - if the principal mode is modified) ... but that would be complicated to implement and to understand for the end user!
@PatGendre @robfitzgerald @jf87 @asiripanich since all of you have worked with the data model, feedback would be appreciated
we will have both confirmed_trip
and confirmed_section
confirmed_trip
will have a field for confirmed_vals
.
primary_mode
and purpose
entries. For branches that use embedded surveys, this will have the survey JSON.confirmed_trip
- confirmed_vals
- primary_mode
- purpose
confirmed_trip
will also have an inferred_vals
with a primary_mode
entry on master. This will be the same even for branches that use embedded surveys, since they do not include any additional inference algorithms.confirmed_section
will also have user_input
. The user_input
will have only a mode
entry. Only the primary section of a trip will have the user_input
set.
Determining the primary section of a trip: *
if there is only one section, it is primary. Due to COVID, we are likely to have many unimodal trips now.
if there is more than one section, it is the longest section
Using the confirmed sections for calculations: * For most of the pre-defined modes, we can determine what the calculation factors (energy efficiency/carbon emissions) are. But we do allow users to enter their own modes, and it is not clear how we can handle those in calculations.
Since all our calculations are currently based on mode, if the confirmed_mode
is one for which we don't have a calculation factor, we will use the sensed mode instead.
Whoo! That was a bit complicated but I think it works for now.
From a UI perspective, we will continue to show the confirmed_vals
(if they exist) in the big buttons of trip diary and the inferred mode from the sections on the top - so this just simplifies and pre-computes the values for now.
However, in the dashboard and the CEO ebike gamification, we can switch to confirmed trips and confirmed sections. Concretely, for the CEO ebike gamification, we can use confirmed_trips to determine the number of ebike trips and the % of travel by ebike.
For the dashboard, we can switch to getting the metrics from confirmed_sections
. But if we only get metrics, how do we do the mapping from unknown modes to the corresponding sensed mode? I guess we can do that mapping in the metrics calculation for now.
i agree with the general strategy here, to follow the pattern of building these immutable documents and not to lose any information. regarding the algorithmic selection of primary modes, a few thoughts (which both may not be helpful at this point):
if there is more than one section, it is the longest section
longest by distance or time? also, i could imagine this might differ by survey, where users may want to inject their own "primary section function" (like, for instance, "if {driving|transit} appears anywhere, set it as primary").
Since all our calculations are currently based on mode, if the confirmed_mode is one for which we don't have a calculation factor, we will use the sensed mode instead
might be helpful (?) to write this "analysis_mode" to a field so the user has a record of it.
Whoo!
🦉
longest by distance or time? also, i could imagine this might differ by survey, where users may want to inject their own
Good point. I was going to use distance, since the primary mode is typically motorized, and likely to be much faster than the modes used for the first and last leg.
"primary section function" (like, for instance, "if {driving|transit} appears anywhere, set it as primary").
my assumption was that if people wanted to change this, they would change it in the code. But I guess I could have people pass in a primary section function instead. I might not do that in this first pass though, pending evidence that people actually need it.
I guess we can do that mapping in the metrics calculation for now.
We can actually do this by adding an entry for mode_for_calculation
for each section on the server. At some point, we can actually move the CO2 calculation to the server so it can be based on the energy profile of transportation in the trip location.
We originally started calculating values on the client because we were also calculating the calories burned and we didn't want to send over weight and height information to the server. But the CO2/EE calculations are making more and more sense on the server.
hi @shankari
here are a few thoughts:
primary section [...] it is the longest section
It is still not clear to me if the end-user will really understand what "primary section" vs "primary mode" means, and what it implies in terms of metrics calculation. An alternative would be to link between trip mode and section mode only if there is only one section, or if the longest section if obviously the primary section, say if its length is more then the half of the trip length. And otherwise to leave the primary section confirmed mode blank ... and wait until there is a full trip/section edition feature.
Since all our calculations are currently based on mode, if the confirmed_mode is one for which we don't have a calculation factor, we will use the sensed mode instead.
This a reasonable rule. In the future a feature could be added so that the end-user can state that the mode is similar to an existing mode, for example wheel chair is similar to walking in terms of calculation, and e-scooter similar to ebike... Anyway as long as the calculation parameters are not personalised for each end-user, the calculations are only indicative (e.g for car the emission figures can vary a lot from a model to another).
We originally started calculating values on the client because we were also calculating the calories burned and we didn't want to send over weight and height information to the server. But the CO2/EE calculations are making more and more sense on the server.
I agree, as long as the weight/height privacy can be managed and/or the end-user agrees to send these personal data to the server.
I agree, as long as the weight/height privacy can be managed and/or the end-user agrees to send these personal data to the server.
I think we would split the calculations. CO2/EE would be on the server; calorie (which doesn't depend on motorized mode) would be on the phone.
It is still not clear to me if the end-user will really understand what "primary section" vs "primary mode" means, and what it implies in terms of metrics calculation. An alternative would be to link between trip mode and section mode only if there is only one section, or if the longest section if obviously the primary section, say if its length is more then the half of the trip length. And otherwise to leave the primary section confirmed mode blank ... and wait until there is a full trip/section edition feature.
I want to clarify that this will not necessarily affect the UI at this time. The UI currently displays section modes at the top of the card, and displays trip mode overrides at the bottom. The proposed data model will allow us to continue doing that.
@shankari Thank you for clarifying, I did not get that point.
The main user-visible difference will be in the dashboard and the calculations
Also, the diary screen code could label automatically the primary mode button (with the primary section inferred mode when a section makes >50% of the length of the trip) but it might not be very useful.
Also, the diary screen code could label automatically the primary mode button (with the primary section inferred mode when a section makes >50% of the length of the trip) but it might not be very useful.
Yes, that would also be confusing, as you pointed out earlier. We already show the inferred mode at the top of the trip card, so it is not like the information is missing. We can change this later if we have time to run some user tests.
while implementing this, I modified the server code to be consistent with the phone. And found that there were user inputs that didn't match any trips. While investigating that further, I discovered that:
Experimenting further, the trips do have matches before the pipeline is run, but the matches break once the pipeline runs. Need to investigate this and fix both phone and server implementations.
In draft mode, we have: 9:19 -> 9:32: Bike 9:34 -> 9:55: Bike 5:28 -> 5:49: Bike 5:54 -> 6:46: Walk, not a trip 7:02 -> 7:50: Bike
After increasing the "end of trip" buffer to 15 mins, we get: 9:19 -> 9:29: Bike 9:33 -> 9:46: Bike 5:22 -> 5:56: Bike 6:54 -> 7:21: blank
The last entry doesn't match because the gap is fairly large (30 mins). On the phone, any attempt at fixing that would require additional server calls. But on the server, we could try to check the raw trips.
This is really weird. The cleaned trips are:
{'_id': ObjectId('5fda8a44b368d4a4b76d0042'),
'data': {'start_fmt_time': '2016-12-12T17:22:22.062618-08:00'},
'end_fmt_time': '2016-12-12T17:56:53.030000-08:00',
},
{'_id': ObjectId('5fda8a45b368d4a4b76d008b'),
'data': {'start_fmt_time': '2016-12-12T18:54:58.134886-08:00'},
'end_fmt_time': '2016-12-12T19:21:30.623000-08:00'}]
The raw trips are:
{'_id': ObjectId('5fda8a42b368d4a4b76cffe0'),
'data': {'start_fmt_time': '2016-12-12T17:27:24.524000-08:00',
'end_fmt_time': '2016-12-12T17:56:53.030000-08:00'}},
{'_id': ObjectId('5fda8a42b368d4a4b76cffe2'),
'data': {'start_fmt_time': '2016-12-12T18:07:22.524000-08:00',
'end_fmt_time': '2016-12-12T18:09:27.147000-08:00'}},
{'_id': ObjectId('5fda8a42b368d4a4b76cffe4'),
'data': {'start_fmt_time': '2016-12-12T18:38:25.007000-08:00',
'end_fmt_time': '2016-12-12T18:39:59.749000-08:00'}},
{'_id': ObjectId('5fda8a42b368d4a4b76cffe6'),
'data': {'start_fmt_time': '2016-12-12T19:02:04.350000-08:00',
'end_fmt_time': '2016-12-12T19:21:30.623000-08:00'}},
{'_id': ObjectId('5fda8a42b368d4a4b76cffe8'),
'data': {'start_fmt_time': '2016-12-12T19:27:35.382000-08:00',
'end_fmt_time': '2016-12-12T19:29:05.394000-08:00'}},
{'_id': ObjectId('5fda8a42b368d4a4b76cffea'),
'data': {'start_fmt_time': '2016-12-12T19:46:59.088000-08:00',
'end_fmt_time': '2016-12-12T19:48:02.122000-08:00'}}]
draft trips: 5:28 -> 5:49: 5:54 -> 6:46: 7:02 -> 7:50:
cleaned trips: 5:22 -> 5:56 6:54 -> 7:21
raw trips: 5:27 -> 5:56 6:07 -> 6:09 6:38 -> 6:39 7:02 -> 7:21 7:27 -> 7:29 7:46 -> 7:48
So I don't think that the raw trips will work either.
I can think of two potential fixes. First, on the phone, we change the first part of the check to trip_start <= ui_start <= trip_end. This is consistent with the query on the server, so we can skip that part, or retain it for maintainability.
For the second part, if the current checks fail, on the server, we can check to see:
On the phone, check (2) would be require another server call. check (1) could be done locally iff it was not the last trip of the day. But we will also change the geojson generation code to use confirmed_trips, so eventually the server code will be the only code.
And this is already broken on the phone, so it is not like it will break any further.
I ended up fixing this on the phone too with option (1), just to ensure that it worked. In case this was the last trip of the day, I checked to ensure that the user input didn't span two days, and then returned true. With this fix, all user inputs are visible even for cleaned trips.
Ah, but there is a wrinkle.
Both the draft trips 5:28 -> 5:49: 5:54 -> 6:46:
match the new criteria for cleaned trip (5:22 -> 5:56). And assuming the user went through and labelled in order, the second label will be more recent than the first, and will be returned, although the first one is clearly the better match. This is because the first one overlaps more substantially than the second. Need to make the selection of the potential candidates more sophisticated as well.
But that is super complicated because you are going to have dueling orders and we have to put some thresholds anyway. Let's just add a validity check which says that the overlap should be at least 50% for the user label to make sense. That makes sense overall, because if it was off by a lot, then maybe the user label was not that relevant anyway.
That seems to fix it for this case and it seems pretty straightforward as well 👍 Another option is to just reject "Not a trip" entries.
For the 5:22 -> 5:56 trip
Cleaned trip: comparing user = 2016-12-12T09:19:37-08:00 -> 2016-12-12T09:32:11-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:34:16-08:00 -> 2016-12-12T09:55:07-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T17:28:44-08:00 -> 2016-12-12T17:49:48-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are true && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:34:16-08:00 -> 2016-12-12T09:55:07-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:19:37-08:00 -> 2016-12-12T09:32:11-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T17:54:51-08:00 -> 2016-12-12T18:46:37-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are true && true end checks are false || false)
Second level of end checks when the next trip is defined(1481597197.113 <= 1481597698.1348858) = true
Flipped endCheck, overlap(121.53799986839294)/trip(2070.9673824310303) = 0.05868658333272448
Cleaned trip: comparing user = 2016-12-12T19:02:04-08:00 -> 2016-12-12T19:50:04-08:00 trip = 2016-12-12T17:22:22.062618-08:00 -> 2016-12-12T17:56:53.030000-08:00 start checks are true && false end checks are false || false)
In getUserInputForTripStartEnd, one potential candidate, returning 2016-12-12T17:28:44-08:00(1481592524.076) -> 2016-12-12T17:49:48-08:00(1481593788.738) pick_drop logged at 1516072736.493696
For the 6:54 -> 7:21 case:
Cleaned trip: comparing user = 2016-12-12T09:19:37-08:00 -> 2016-12-12T09:32:11-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:34:16-08:00 -> 2016-12-12T09:55:07-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T17:28:44-08:00 -> 2016-12-12T17:49:48-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:34:16-08:00 -> 2016-12-12T09:55:07-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T09:19:37-08:00 -> 2016-12-12T09:32:11-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T17:54:51-08:00 -> 2016-12-12T18:46:37-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are false && true end checks are true || true)
Cleaned trip: comparing user = 2016-12-12T19:02:04-08:00 -> 2016-12-12T19:50:04-08:00 trip = 2016-12-12T18:54:58.134886-08:00 -> 2016-12-12T19:21:30.623000-08:00 start checks are true && true end checks are false || false)
Second level of end checks for the last trip of the day
compare 12 with 12 = true
Flipped endCheck, overlap(1166.2730000019073)/trip(1592.488114118576) = 0.732358998263184
In getUserInputForTripStartEnd, one potential candidate, returning 2016-12-12T19:02:04-08:00(1481598124.35) -> 2016-12-12T19:50:04-08:00(1481601004.01) pick_drop logged at 1516072777.245432
And server-side test passes as well
----------------------------------------------------------------------
Ran 1 test in 9.620s
One final consideration while creating the pipeline is how to set the timestamps for pipeline states. We will need to use the write_ts
of the user input but the end_ts
of the confirmed trips. Let's walk through how this will work in both cases and ensure that there are no surprises.
Let's try a different approach:
user confirms in draft mode
user confirms in confirmed mode
Actually, that only fixed the trips, we also need to fix sections.
Hi @shankari you've been improving the trip labeling process lastly, and introduced a new confirmed_trip key, but it is not clear to me, I still have at least 2 questions :
@PatGendre the deployer dashboard (emdash) uses the user labeled modes, but the in-app dashboard (the "metrics" screen) does not, at least partially because I wasn't sure how to deal with the mismatch between trip and section labeling that you outlined.
The A-mission branch of the project (https://github.com/xubowenhaoren/A-Mission) from UW supports confirming sections, along with a bunch of accessibility improvements. If somebody wanted to merge the changes over to master, it would help a lot wrt modifying the in-app dashboard as well.
@shankari thanks ! Yann will have a look at A-mission, of course I'll tell you if we envisage merging this appealing section labeling into master (we'd have to find a budget).
@shankari FYI as there is no budget for improving to complete the labeling feature with indicators taking into account the modes labeled manually by the user, it was decided to remove the labeling feature (at least for a few months), because we think the user won't understand why the (mode, purpose) labels he enters have no effect on the dashboard indicators.
And I have another question, as I've seen that GabrielKS is implementing a "label inference pipeline" : do you have a kind of "functional spec" of what this pipeline will do (on the server and on the app side)? Thanks
@PatGendre I apologize for the delay in responding to your comments about the confirmed_trip
objects, but the CanBikeCO deployments are just starting up, and I was on vacation for a week.
And I have a couple of interns working on improving the labeling by determining common and novel trips. The related issues are: for the analysis: https://github.com/e-mission/e-mission-docs/issues/606 for the system integration: https://github.com/e-mission/e-mission-docs/issues/647
This is a key component of our ongoing work since there is information that we cannot automatically detect - e.g. purpose and replaced mode. So we have to urgently reduce the user labeling burden.
@shankari no problem, it is not an urgent question for us!
Thanks for your reply, I understand better what you intend to do.
If I understand well enough, with what your interns will implement, there will still be the question of labeling modes à trip level while actual mode is at the section level.
FYI Fouad and Yann worked on clustering stops outside of e-mission in postgis in 2019, so as to produce statistics on frequent places and frequent trips between places, even if we didn't intend to try ML on automatic mode/purpose labeling, it was already interesting.
We would like to do this again with la Rochelle, but rather in a python notebook than in postgis, but (looking at it quickly) I've found the k-means clustering methods of postgis available in shapely for python...
We want to use the transportation mode label to actually impact the metrics shown inside the dashboard.
Since the labels are currently "decoration only", we need to implement some code that e.g. adds an endpoint that will basically do the same as the userMetrics endpoint but use the label as mode of transportation if a label has been applied to that trip.
My question being: in which file would I best start to look for a place to implement this feature?