e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 32 forks source link

negative durations for sections and places #462

Open PatGendre opened 4 years ago

PatGendre commented 4 years ago

In the analysis timeseries collection, we encounter several (around 10%) sections (cleaned and inferred) with negative durations (and very few with null duration).
For places also we find quite a lot of <0 duration, and null duration. For stops, no <0 duration but a lot of 0 duration as well. How should we interpret this? This is not the case for trips. (only duration >0)

shankari commented 4 years ago

Hm. Let me check if that is true for the sections/places from the public dataset as well

shankari commented 4 years ago

I checked the public dataset and there were 3 negative duration sections.

In [2]: import emission.core.get_database as edb

In [3]: edb.get_analysis_timeseries_db().find({"metadata.key": "analysis/cleaned_section", "data.duration": {"$lte": 0}}).count()
Out[3]: 3

However, if I looked at the details, they were actually positive. I bet we are not recalculating the duration somewhere.

In [4]: neg_values = list(edb.get_analysis_timeseries_db().find({"metadata.key": "analysis/cleaned_section", "data.duration": {"$lte": 0}}))

In [6]: for v in neg_values:
               print(v["data"]["start_ts"], v["data"]["end_ts"], v["data"]["start_fmt_time"], v["data"]["end_fmt_time"], v["data"]["duration"])

1564161265.0 1564161447.0 2019-07-26T10:14:25-07:00 2019-07-26T10:17:27-07:00 -60.0
1568740561.0 1568740733.0 2019-09-17T10:16:01-07:00 2019-09-17T10:18:53-07:00 -156.0
1568736850.0 1568737129.0 2019-09-17T09:14:10-07:00 2019-09-17T09:18:49-07:00 -260.0

This should be a simple fix; I just need to ensure that we are recalculating the duration after extrapolating the section start/end. I will generate a patch sometime soon.

Regardless, the start and end timestamps seem to be fine.

PatGendre commented 4 years ago

@shankari FYI we have corrected the data in mongodb with this request db.getCollection('Stage_analysis_timeseries').find({"metadata.key": "analysis/cleaned_section"}).forEach(function(elt) {elt.data.duration = elt.data.end_ts - elt.data.start_ts; db.Stage_analysis_timeseries.save(elt);})

We can then check than there is no negative value and no NaN with : db.getCollection('Stage_analysis_timeseries').find({"metadata.key": "analysis/cleaned_section", "data.duration": {"$lte": 0.0}}) db.getCollection('Stage_analysis_timeseries').find({"metadata.key": "analysis/cleaned_section", "data.duration": {"$eq": NaN}})

And the same requests for the inferred_section data.

We will periodically apply this until we can pull the server correction to our code (it is non urgent!).

PatGendre commented 4 years ago

@shankari FYI we have found 1 trip (only 1!) with a duration < 0, and in this case we cannot correct it by replacing the duration with end_ts - start_ts because in this case we have already duration=end_ts - start_ts and end_ts<start_ts

Here is the trip /* 1 */ { "_id" : ObjectId("5da9992f4e276dedbcfce559"), "user_id" : LUUID("3a2299d0-605a-4d92-bef8-1058d145d301"), "metadata" : { "key" : "analysis/cleaned_trip", "platform" : "server", "write_ts" : 1571395887.39329, "time_zone" : "America/Los_Angeles", "write_local_dt" : { "year" : 2019, "month" : 10, "day" : 18, "hour" : 3, "minute" : 51, "second" : 27, "weekday" : 4, "timezone" : "America/Los_Angeles" }, "write_fmt_time" : "2019-10-18T03:51:27.393295-07:00" }, "data" : { "source" : "DwellSegmentationDistFilter", "end_ts" : 1560277573.911, "end_local_dt" : { "year" : 2019, "month" : 6, "day" : 11, "hour" : 20, "minute" : 26, "second" : 13, "weekday" : 1, "timezone" : "Europe/Paris" }, "end_fmt_time" : "2019-06-11T20:26:13.911000+02:00", "end_loc" : { "type" : "Point", "coordinates" : [ -1.5649361, 47.3088074 ] }, "raw_trip" : ObjectId("5da997f04e276dedbcfce2fc"), "start_ts" : 1565829831.81263, "start_local_dt" : { "year" : 2019, "month" : 8, "day" : 15, "hour" : 2, "minute" : 43, "second" : 51, "weekday" : 3, "timezone" : "Europe/Paris" }, "start_fmt_time" : "2019-08-15T02:43:51.812628+02:00", "start_loc" : { "type" : "Point", "coordinates" : [ -1.50566755990457, 47.2459261394138 ] }, "duration" : -5552257.90162754, "distance" : 295.442804567268, "start_place" : ObjectId("5da999814e276dedbcfd0570"), "end_place" : ObjectId("5da999814e276dedbcfd0571") } }

BUT when looking at this trip, it starts in August and end in June so it is very bizarre. Don't take this into account !!

PatGendre commented 4 years ago

@shankari note also that we find some sections with duration==0 ; most of them have also distance==0 but not all.

PatGendre commented 4 years ago

@shankari finally we find places with duration<0 and duration=exit_ts-enter_ts (so the user is tracked as exiting left the place before entering it).

lgharib commented 2 years ago

Hi, Was this issue fixed ? We are facing the same issue with negative duration Also we are facing the issue for durations and distances with a value of 0.

shankari commented 2 years ago

@lgharib nope, the issue has not yet been fixed. I double checked the staging system on CanBikeCO and we do have negative duration sections

>>> edb.get_analysis_timeseries_db().find({"metadata.key": "analysis/cleaned_section", "data.duration": {"$lte": 0}}).count()
86

But we don't have any negative duration trips. The CanBikeCO program is currently focused on trips, which is why we haven't prioritized this. Let me take a look.

>>> edb.get_analysis_timeseries_db().find({"metadata.key": "analysis/cleaned_trip", "data.duration": {"$lte": 0}}).count()
0
PatGendre commented 2 years ago

@shankari Hi, I confirmed than we still have many negative duration trips on tracemob too. Also, we've had recently a trip with distance 0 and duration 0 for which I create a new issue