e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

Want to see mode and purpose respective of uuid #350

Closed deepalics0044 closed 5 years ago

deepalics0044 commented 5 years ago

I ran the pipeline and got the recent mode and purposes made by Ipsita ma'am. But I see the aggregate mode and purpose the code I run for -

esta.TimeSeries.get_aggregate_time_series().get_data_df("manual/mode_confirm")[[
"start_fmt_time", "end_fmt_time", "label"]]

What code needs to apply to see the mode and purpose respective of uuid?

shankari commented 5 years ago

@deepalics0044 couple of notes

The aggregate timeseries still returns the uuid along with each entry. It's just that get_data_df doesn't map the uuid to a column. So if you use the entries instead, you should be able to see the uuid as well (e.g. something like)

mc_all = esta.TimeSeries.get_aggregate_time_series().find_entries("manual/mode_confirm")
[(e["user_id"], e["data"]["start_fmt_time"], e["data"]["end_fmt_time"], e["data"]["label"]) for e in mc_all]

But presumably what you really want to do is to combine a particular user's location, analysed trips, and mode confirmation. As you can see from https://github.com/e-mission/e-mission-server/blob/master/Timeseries_Sample.ipynb you can get the timeseries for a particular user and then retrieve dataframes for each of the keys that you want.

deepalics0044 commented 5 years ago

I am able to see uuid simply by adding

esta.TimeSeries.get_aggregate_time_series().get_data_df("manual/purpose_confirm")[["_id",
"start_fmt_time", "end_fmt_time", "label"]]

But presumably what you really want to do is to combine a particular user's location, analysed trips, and mode confirmation.

But yes you're right I want to combine mode and purpose with location and time.

For some reasons I am not able to access the link. Is it the same Timeseries_Sample.ipynb we have in e-mission-server root directory?

shankari commented 5 years ago

I am able to see uuid simply by adding

That is _id aka objectid, not user_id. You will see that the values are different even if you just look at the confirmations from @ipsita0012

For some reasons I am not able to access the link. Is it the same Timeseries_Sample.ipynb we have in e-mission-server root directory?

Yes. As you can see from the path, it is a link to the file in the root of the e-mission-server repository on github.

deepalics0044 commented 5 years ago

That is _id aka objectid, not user_id. You will see that the values are different even if you just look at the confirmations from @ipsita0012

I see the values were different changed the columns to _id to user_id

shankari commented 5 years ago

@deepalics0044 I thought we didn't put the user_id into the dataframe, but you're right that we do! Please close this issue if there is nothing left to do.

shankari commented 5 years ago

Actually I am proactively closing the issue because otherwise issues just linger on forever.

deepalics0044 commented 5 years ago

But presumably what you really want to do is to combine a particular user's location, analysed trips, and mode confirmation. As you can see from https://github.com/e-mission/e-mission-server/blob/master/Timeseries_Sample.ipynb you can get the timeseries for a particular user and then retrieve dataframes for each of the keys that you want.

Can't really see mode as dataframe


ct_df.columns

Index(['_id', 'distance', 'duration', 'end_fmt_time', 'end_loc',
       'end_local_dt_day', 'end_local_dt_hour', 'end_local_dt_minute',
       'end_local_dt_month', 'end_local_dt_second', 'end_local_dt_timezone',
       'end_local_dt_weekday', 'end_local_dt_year', 'end_place', 'end_ts',
       'metadata_write_ts', 'raw_trip', 'source', 'start_fmt_time',
       'start_loc', 'start_local_dt_day', 'start_local_dt_hour',
       'start_local_dt_minute', 'start_local_dt_month',
       'start_local_dt_second', 'start_local_dt_timezone',
       'start_local_dt_weekday', 'start_local_dt_year', 'start_place',
       'start_ts', 'user_id'],
      dtype='object')
shankari commented 5 years ago

mode_confirm is not stored in a trip (analysis/cleaned_trip), it is a separate object (manual/mode_confirm). You would get a separate dataframe for it by retrieving objects with that key.

The documentation on manual objects (which can be found by searching for mode_confirm in the docs) has additional details on how to match the confirmation to cleaned trip-like objects. https://github.com/e-mission/e-mission-docs/blob/master/docs/e-mission-both/supporting_user_inputs.md

deepalics0044 commented 5 years ago

Is'nt it possible to put the data frame 'label' from mode and purpose object in cleaned_trip data frame.

The cleaned_trip data frame columns:-

Index(['_id', 'distance', 'duration', 'end_fmt_time', 'end_loc',
       'end_local_dt_day', 'end_local_dt_hour', 'end_local_dt_minute',
       'end_local_dt_month', 'end_local_dt_second', 'end_local_dt_timezone',
       'end_local_dt_weekday', 'end_local_dt_year', 'end_place', 'end_ts',
       'metadata_write_ts', 'raw_trip', 'source', 'start_fmt_time',
       'start_loc', 'start_local_dt_day', 'start_local_dt_hour',
       'start_local_dt_minute', 'start_local_dt_month',
       'start_local_dt_second', 'start_local_dt_timezone',
       'start_local_dt_weekday', 'start_local_dt_year', 'start_place',
       'start_ts', 'user_id'],
      dtype='object')

The mode data frame columns:-

Index(['_id', 'end_fmt_time', 'end_local_dt_day', 'end_local_dt_hour',
       'end_local_dt_minute', 'end_local_dt_month', 'end_local_dt_second',
       'end_local_dt_timezone', 'end_local_dt_weekday', 'end_local_dt_year',
       'end_ts', 'label', 'metadata_write_ts', 'start_fmt_time',
       'start_local_dt_day', 'start_local_dt_hour', 'start_local_dt_minute',
       'start_local_dt_month', 'start_local_dt_second',
       'start_local_dt_timezone', 'start_local_dt_weekday',
       'start_local_dt_year', 'start_ts', 'user_id'],
      dtype='object')

I want to see start_loc | end_loc | start_fmt_time | end_fmt_time | label(mode) | label(purpose) together?

shankari commented 5 years ago

@deepalics0044 that is a pandas question. Feel free to look at the pandas documentation on how to merge two dataframes.

Please note that you cannot just merge the dataframes naively because there may not be a 1:1 correspondence between trip and mode because the user may not have confirmed every trip. Or they may have confirmed the mode and not the purpose. Or they may have confirmed the mode twice. That's why I recommend using the pre-written function get_user_input_for_trip_object

deepalics0044 commented 5 years ago

Feel free to look at the pandas documentation on how to merge two dataframes. I went through some of the pandas codes. Looks like we can merge data frames. Will outer join help?

shankari commented 5 years ago

@deepalics0044 I guess, if you do it right. Have you tried it? It's not like I know the answer to this question and I am making you figure it out as part of a class. I have not merged these two dataframes before; if I did, it would be in the code or the documentation.

Try it out and once you figure it out, contribute it here in case others want to re-use it. Think of it as writing a stackoverflow answer :)

shankari commented 5 years ago

If you have tried a bunch of things and nothing works, you can put in what you tried and why it didn't work and I might be able to give you some pointers.

deepalics0044 commented 5 years ago

One thing I tried doing is

frames1=pd.merge(ct_df, ct_dfm,on="start_fmt_time", how="inner")
frames1[["start_loc","start_fmt_time","end_loc","label"]]

Though I see correct start_fmt_time BUT I don't get accurate results because there are overall 32 entries made for mode and I see only 6.

  | start_loc | start_fmt_time | end_loc | label
-- | -- | -- | -- | --
{'type': 'Point', 'coordinates': [77.6264306, ... | 2018-08-15T09:00:27+05:30 | {'type': 'Point', 'coordinates': [77.5634483, ... | taxi
{'type': 'Point', 'coordinates': [77.6264306, ... | 2018-08-15T09:00:27+05:30 | {'type': 'Point', 'coordinates': [77.5634483, ... | taxi
{'type': 'Point', 'coordinates': [77.5683932, ... | 2018-08-15T20:42:14.393609+05:30 | {'type': 'Point', 'coordinates': [77.5714665, ... | bike
{'type': 'Point', 'coordinates': [77.5688895, ... | 2018-10-30T21:11:20.081000+05:30 | {'type': 'Point', 'coordinates': [77.5687387, ... | Namma Metro
{'type': 'Point', 'coordinates': [77.569044, 1... | 2018-10-31T11:28:44+05:30 | {'type': 'Point', 'coordinates': [77.5640412, ... | taxi
{'type': 'Point', 'coordinates': [77.7084547, ... | 2018-10-31T13:30:57.896000+05:30 | {'type': 'Point', 'coordinates': [77.7021772, ... | Flight
{'type': 'Point', 'coordinates': [77.5641222, ... | 2018-12-22T18:06:32+05:30 | {'type': 'Point', 'coordinates': [77.563883, 1... | walk

Also, getting key error for end_fmt_time.Maybe because the join is inner

frames1[["start_loc","start_fmt_time","end_loc","end_fmt_time","label"]]
KeyError: "['end_fmt_time'] not in index"
deepalics0044 commented 5 years ago

The best possible result till now I got is using


frames1=pd.merge(ct_df, ct_dfm,on="start_fmt_time",how="right")
frames1[["start_loc","start_fmt_time","end_loc","label"]]

But it is also not accurate.

shankari commented 5 years ago

@deepalics0044 I already said

Please note that you cannot just merge the dataframes naively because there may not be a 1:1 correspondence between trip and mode because the user may not have confirmed every trip. Or they may have confirmed the mode and not the purpose. Or they may have confirmed the mode twice. That's why I recommend using the pre-written function get_user_input_for_trip_object

you cannot just try the naive merges. they will not work. you have to use get_user_input_for_trip_object.

shankari commented 5 years ago

If you must use pandas, I would recommend setting a column to the result of apply. similar to this https://github.com/e-mission/e-mission-server/blob/3a5e2c921f41ea1bbeaec0d49f4fc722d418794d/bin/analysis/get_app_analytics.py#L28 but with get_user_input_for_trip_object as the function that you are applying

Alternatively, if you are not familiar with pandas, you can use find_entries instead of get_data_df, get a list and iterate through the list of trips, finding the corresponding mode_confirm for each using get_user_input_for_trip_object

shankari commented 5 years ago

@deepalics0044 were you able to resolve this? Did you use find_entries or DataFrame.apply? If you could document your solution here, it would help other users with the same question.

deepalics0044 commented 5 years ago

As I want to get the data in columns, by using pandas I still have some results( studying it) but using pre written functions is more of a challenge for me.

wrt:-

Alternatively, if you are not familiar with pandas, you can use find_entries instead of get_data_df, get a list and iterate through the list of trips, finding the corresponding mode_confirm for each using get_user_input_for_trip_object

Are these the changes need to be applied -

entry_it = ts.find_entries(["analysis/cleaned_trip"], time_query=None) 
for ct in entry_it:

    cte = ecwe.Entry(ct)

    print("=== Trip:", cte.data.start_loc, "->", cte.data.end_loc) // Is this each trip?

 user_label = esdt.get_user_input_for_trip_object("manual/mode_confirm", test_user_id, cte.get_id()) 

    section_it = esdt.get_sections_for_trip("analysis/cleaned_section", test_user_id, cte.get_id()) 
// section corresponding to each trip? If yes the code written above makes sense?

    for sec in section_it:

        print("  --- Section:", sec.data.start_loc, "->", sec.data.end_loc, " on ", sec.data.sensed_mode)
shankari commented 5 years ago

Largely, yes, but code review is not a substitute for testing. My only feedback is that, for performance reasons, since you already have the trip object, you can use get_user_input_for_trip_object (which the implementation of get_user_input_for_trip defers to) to avoid an additional lookup.

Did you have any errors when you ran it? Here's no harm in running code that retrieves data; it is not going to modify it in anyway.

shankari commented 5 years ago

@deepalics0044 I am going to close this tonight unless you have something specific you would like to ask

shankari commented 5 years ago

closing

deepalics0044 commented 5 years ago
   for ct in entry_it:
cte = ecwe.Entry(ct)

    print("=== Trip:", cte.data.start_loc, "->", cte.data.end_loc)

    user_label = esdt.get_user_input_for_trip_object("manual/mode_confirm", test_user_id, cte.get_id()) 

    section_it = esdt.get_sections_for_trip("analysis/cleaned_section", test_user_id, cte.get_id())

    for sec in section_it:

        print("  --- Section:", sec.data.start_loc, "->", sec.data.end_loc, " on ", sec.data.sensed_mode)

The error I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-581-5f4a698ed5ac> in <module>()
      9 
     10 
---> 11     user_label = esdt.get_user_input_for_trip("manual/mode_confirm", test_user_id, cte.get_id())
     12 
     13 

TypeError: get_user_input_for_trip() missing 1 required positional argument: 'user_input_key'
shankari commented 5 years ago

@deepalics0044 get_user_input_for_trip requires 4 arguments and you are passing in only 3. I did say that code review is not a substitute for testing :)

shankari commented 5 years ago

concretely, you want to use something like get_user_input_for_trip("analysis/cleaned_trip", test_user_id, cte.get_id(), "manual/mode_confirm")

deepalics0044 commented 5 years ago

Not able to see all mode label using the above function. I have altogether 14 mode label and can see only 5-6 using the function

=== Trip: {"coordinates": [77.5687779, 13.0146252], "type": "Point"} -> {"coordinates": [77.5633922, 13.030077], "type": "Point"}
Entry({'_id': ObjectId('5d038018317394de3a565127'), 'user_id': UUID('79d9df48-6b44-4a5b-8333-1211b33aedc8'), 'metadata': {'key': 'manual/mode_confirm', 'platform': 'android', 'read_ts': 0, 'time_zone': 'Asia/Kolkata', 'type': 'message', 'write_ts': 1560493331.145, 'write_local_dt': {'year': 2019, 'month': 6, 'day': 14, 'hour': 11, 'minute': 52, 'second': 11, 'weekday': 4, 'timezone': 'Asia/Kolkata'}, 'write_fmt_time': '2019-06-14T11:52:11.145000+05:30'}, 'data': {'start_ts': 1560492540.159, 'end_ts': 1560493136.119, 'label': 'Walk', 'start_local_dt': {'year': 2019, 'month': 6, 'day': 14, 'hour': 11, 'minute': 39, 'second': 0, 'weekday': 4, 'timezone': 'Asia/Kolkata'}, 'start_fmt_time': '2019-06-14T11:39:00.159000+05:30', 'end_local_dt': {'year': 2019, 'month': 6, 'day': 14, 'hour': 11, 'minute': 48, 'second': 56, 'weekday': 4, 'timezone': 'Asia/Kolkata'}, 'end_fmt_time': '2019-06-14T11:48:56.119000+05:30'}})
=== Trip: {"coordinates": [77.5633922, 13.030077], "type": "Point"} -> {"coordinates": [77.5641214, 13.0163405], "type": "Point"}
None

Likewise I see only five.

shankari commented 5 years ago

@deepalics0044 hm, I wonder if there is an underlying issue with the matching algorithm that is causing both this issue and the earlier one that you reported where you couldn't see the trip-end prompt results in the diary. The matching algorithm is pretty simple - can you look at the trip details and the confirmation object details and figure out why it doesn't work.

Alternatively, you can send me the dump for the day with the mismatch and I can take a look.

shankari commented 5 years ago

I am not able to reproduce the problem. Here's the list of trips.

In [12]: for ct in entry_it:
    ...:         cte = ecwe.Entry(ct)
    ...:         print("=== Trip:", cte.data.start_fmt_time, "->", cte.data.end_fmt_time
    ...: )
    ...:
=== Trip: 2019-06-18T15:54:46.179493+05:30 -> 2019-06-18T16:23:14.659000+05:30
=== Trip: 2019-06-18T16:41:27.850233+05:30 -> 2019-06-18T16:50:00.584000+05:30
=== Trip: 2019-06-18T18:32:16.203440+05:30 -> 2019-06-18T18:42:36.349000+05:30
=== Trip: 2019-06-18T19:19:03.393391+05:30 -> 2019-06-18T19:22:58.869000+05:30
=== Trip: 2019-06-19T11:08:24.697000+05:30 -> 2019-06-19T11:10:59.636000+05:30
=== Trip: 2019-06-19T11:21:39.129006+05:30 -> 2019-06-19T11:33:33.966000+05:30
=== Trip: 2019-06-19T11:37:20.094655+05:30 -> 2019-06-19T11:44:37.301000+05:30
=== Trip: 2019-06-19T12:32:53.399599+05:30 -> 2019-06-19T12:45:46.527000+05:30

and here's the list of confirm objects

=== Confirm: 2019-06-19T11:08:34.738000+05:30 -> 2019-06-19T11:14:47.839000+05:30
=== Confirm: 2019-06-19T11:08:24.697000+05:30 -> 2019-06-19T11:10:59.636000+05:30
=== Confirm: 2019-06-19T11:21:39.129006+05:30 -> 2019-06-19T11:33:33.966000+05:30
=== Confirm: 2019-06-19T11:37:20.094655+05:30 -> 2019-06-19T11:44:37.301000+05:30
=== Confirm: 2019-06-19T12:32:53.399599+05:30 -> 2019-06-19T12:45:46.527000+05:30

Note that the first two entries are essentially for the same trip.

And when I match them up, I get

=== Trip: 2019-06-18T16:41:27.850233+05:30 -> 2019-06-18T16:50:00.584000+05:30
=== Trip: 2019-06-18T18:32:16.203440+05:30 -> 2019-06-18T18:42:36.349000+05:30
=== Trip: 2019-06-18T19:19:03.393391+05:30 -> 2019-06-18T19:22:58.869000+05:30
=== Trip: 2019-06-19T11:08:24.697000+05:30 -> 2019-06-19T11:10:59.636000+05:30
~~~ Confirm: 2019-06-19T11:08:34.738000+05:30 -> 2019-06-19T11:14:47.839000+05:30
=== Trip: 2019-06-19T11:21:39.129006+05:30 -> 2019-06-19T11:33:33.966000+05:30
~~~ Confirm: 2019-06-19T11:21:39.129006+05:30 -> 2019-06-19T11:33:33.966000+05:30
=== Trip: 2019-06-19T11:37:20.094655+05:30 -> 2019-06-19T11:44:37.301000+05:30
~~~ Confirm: 2019-06-19T11:37:20.094655+05:30 -> 2019-06-19T11:44:37.301000+05:30
=== Trip: 2019-06-19T12:32:53.399599+05:30 -> 2019-06-19T12:45:46.527000+05:30
~~~ Confirm: 2019-06-19T12:32:53.399599+05:30 -> 2019-06-19T12:45:46.527000+05:30

which seems to be fine. I suspect that the reason you have additional confirm objects that are not matched is because they are actually duplicates of the ones that do match and when we find duplicates, we pick the last one.

shankari commented 5 years ago

Please reopen the issue and send me the logs of the days with the mismatch if this is not true.

deepalics0044 commented 5 years ago

Well after testing , I agree duplicates of the last pick is the reason behind I had additional confirm objects . The function works totally perfect.

deepalics0044 commented 5 years ago

Screenshot from 2019-07-10 10-15-02

deepalics0044 commented 5 years ago

The following code I used :-

countLabel = []
countLabel1 = []
for ct in entry_it:

    cte = ecwe.Entry(ct)

    print("=== Trip:", cte.data.start_loc,"->",cte.data.start_fmt_time, "->", cte.data.end_loc,"->", cte.data.end_fmt_time)

    user_label = esdt.get_user_input_for_trip("analysis/cleaned_trip", test_user_id, cte.get_id(), "manual/mode_confirm")
    user_label1 = esdt.get_user_input_for_trip("analysis/cleaned_trip", test_user_id, cte.get_id(), "manual/purpose_confirm")
    print("=== Mode:", user_label.data.label if user_label != None else None)
    print("=== Purpose:", user_label1.data.label if user_label1 != None else None)
    countLabel.append(user_label.data.label if user_label != None else None)
    countLabel1.append(user_label1.data.label if user_label1 != None else None)
    section_it = esdt.get_sections_for_trip("analysis/cleaned_section", test_user_id, cte.get_id())

countLabel = [i for i in countLabel if i]
print('Number of modes entered:',len(countLabel))

countLabel1 = [j for j in countLabel1 if j]
print('Number of purpose entered:',len(countLabel1))
shankari commented 5 years ago

@deepalics0044 thanks for contributing! If you have time, you could submit this (either as a notebook or as a standalone script) with a pull request...

deepalics0044 commented 5 years ago

The below code helped me in extracting the data in tabular form : -

countLabel = []
countLabel1 = []
result = []
for index,user in all_users.iterrows():
    #if(index not in [4,5]):
    #    continue
    print('USER ID: ',user.uuid)
    print('INDEX: ',index)
    print('-------------------------------------')
    ts = esta.TimeSeries.get_time_series(user.uuid)
    entry_it = ts.find_entries(["analysis/cleaned_trip"], time_query=None)
    userTrips = []
    for ct in entry_it:

        cte = ecwe.Entry(ct)

        #print("=== Trip:", cte.data.start_loc,"->",cte.data.start_fmt_time, "->", cte.data.end_loc,"->", cte.data.end_fmt_time)

        user_label = esdt.get_user_input_for_trip("analysis/cleaned_trip", user.uuid, cte.get_id(), "manual/mode_confirm")
        user_label1 = esdt.get_user_input_for_trip("analysis/cleaned_trip", user.uuid, cte.get_id(), "manual/purpose_confirm")
        #print("=== Mode:", user_label.data.label if user_label != None else None)
        #print("=== Purpose:", user_label1.data.label if user_label1 != None else None)
        #countLabel.append(user_label.data.label if user_label != None else None)
        #countLabel1.append(user_label1.data.label if user_label1 != None else None)
        section_it = esdt.get_sections_for_trip("analysis/cleaned_section", user.uuid, cte.get_id())
        testFrame = pd.DataFrame.from_dict({user.uuid:cte.data},orient='index')
        testFrame['mode'] = user_label.data.label if user_label != None else None
        testFrame['purpose'] = user_label1.data.label if user_label1 != None else None
        userTrips.append(testFrame)
    if(userTrips!=[]):
        userTrips = pd.concat(userTrips,ignore_index=True)
        userTrips['uuid'] = user.uuid
        result.append(userTrips)

    #countLabel = [i for i in countLabel if i]
    #print('Number of modes entered:',len(countLabel))

    #countLabel1 = [j for j in countLabel1 if j]
    #print('Number of purpose entered:',len(countLabel1))
    print('-------------------------------------')
result = pd.concat(result,ignore_index=True)
#print(result)
result.to_csv('./trips.csv',index=False)