Switch to the timeseries to read data instead of accessing mongodb directly

shankari commented 1 year ago

I had flagged this initially as a "best practice" https://github.com/e-mission/op-admin-dashboard/pull/7#discussion_r1146866244 https://github.com/e-mission/op-admin-dashboard/pull/7#discussion_r1142877187

However, this turned into a P0 because AWS DocumentDB (which NREL uses as a managed instance of MongoDB, and is close but not 100% compatible with it) does not seems to support the projections that are used.

We get the error

File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/dash/_callback.py", line 447, in add_context
output_value = func(*func_args, **func_kwargs) # %% callback invoked %%
File "app_sidebar_collapsible.py", line 190, in update_store_trips
df = query_confirmed_trips(start_date_obj, end_date_obj)
File "/usr/src/app/utils/db_utils.py", line 66, in query_confirmed_trips
df = pd.json_normalize(list(query_result))
File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/pymongo/cursor.py", line 1207, in next
if len(self.__data) or self._refresh():
File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/pymongo/cursor.py", line 1124, in _refresh
self.__send_message(q)
File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/pymongo/cursor.py", line 1001, in __send_message
address=self.__address)
File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1372, in _run_operation_with_response
exhaust=exhaust)
File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1471, in _retryable_read
return func(session, server, sock_info, slave_ok)
File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1366, in _cmd
unpack_res)
File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/pymongo/server.py", line 137, in run_operation_with_response
first, sock_info.max_wire_version)
File "/root/miniconda-4.12.0/envs/emission/lib/python3.7/site-packages/pymongo/helpers.py", line 168, in _check_command_response
max_wire_version)
pymongo.errors.OperationFailure: Internal server error, full error: {'ok': 0.0, 'operationTime': Timestamp(1681425547, 1), 'code': 42, 'errmsg': 'Internal server error'}

Per NREL IT,

It's aws managed db, do not expose logs directly like other application logs, it only provides metrics and auditing calls/events around the cluster, not that helpful for this issue.

The NREL core team makes sure that the timeseries calls work against both MongoDB and DocumentDB, so using them exclusively to make DB calls would reduce maintenance burden overall.

shankari commented 1 year ago

With this fix, the dashboard loads in the NREL environment without errors.

shankari commented 1 year ago

Keeping the issue open to track the hack removals by @AlirezaRa94

shankari commented 1 year ago

Note that if there is no data, this fails with

ERROR:app_sidebar_collapsible:Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/flask/app.py", line 2528, in wsgi_app
    response = self.full_dispatch_request()
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/flask/app.py", line 1825, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/dash_auth/basic_auth.py", line 36, in wrap
    response = f(*args, **kwargs)
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/dash/dash.py", line 1283, in dispatch
    ctx.run(
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/dash/_callback.py", line 447, in add_context
    output_value = func(*func_args, **func_kwargs)  # %% callback invoked %%
  File "/usr/src/app/app_sidebar_collapsible.py", line 188, in update_store_trips
    df = query_confirmed_trips(start_date_obj, end_date_obj)
  File "/usr/src/app/utils/db_utils.py", line 85, in query_confirmed_trips
    df = df[["user_id", "data.start_fmt_time", "data.end_fmt_time", "data.distance", "data.duration", "data.start_loc.coordinates", "data.end_loc.coordinates"]]
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/frame.py", line 3464, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/indexing.py", line 1314, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis)
  File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/indexing.py", line 1374, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['user_id', 'data.start_fmt_time', 'data.end_fmt_time', 'data.distance',\n       'data.duration', 'data.start_loc.coordinates',\n       'data.end_loc.coordinates'],\n      dtype='object')] are in the [columns]"

We may want to highlight that there is no data in this case instead of letting the error be swallowed silently

shankari commented 1 year ago

@AlirezaRa94 This does not have any errors, but: (1) I cannot see any user labels in the trip table (2) I see the start and end place, which are not needed/relevant since they are internal links to other objects in the data model

I am fixing (2), which is fairly straightforward but not sure why (1) doesn't work.

shankari commented 1 year ago

(1) does not work because there are no additional paths identified in the study-program config. I am not sure how this was ever tested with the dynamic config. @AlirezaRa94 please indicate "testing done" and "testing process" in the future

        "data_trips_columns_exclude": [],
        "additional_trip_columns": [],
        "data_uuids_columns_exclude": [],

shankari commented 1 year ago

I removed all the unnecessary fields using "data_trips_columns_exclude" The good news is that they all were removed! The bad news is that there are other fields that are not displayed correctly, namely the start and end fmt time are both missing.

shankari commented 1 year ago

After adding a bunch of logs, these values are being returned but not displayed. I note that the values that are displayed (except for user_id) are prefixed with data. Coincidence?

{'data.duration': 9358.15185546875, 'user_id': 'fc8f71a3-8c82-458d-bf97-18c3ee11a0f3', 'data.distance': 160547.62160904994, 'trip_start_time_str': '2022-12-31T08:44:23.982959-07:00', 'trip_end_time_str': '2022-12-31T11:20:22.134815-07:00', 'start_coordinates': [...], 'end_coordinates': [...]}

shankari commented 1 year ago

I can confirm that if I comment out this line

df = df.drop(columns=[named_col['path']])

from https://github.com/e-mission/op-admin-dashboard/commit/d977851530c2b6e23ef1978f3643befb93bad26a#diff-1c6b8e6d103286796ce21a8276c4a4d8b258e29d6b9cc6df516a92accf4674d1R72

then I do see the expected columns. So either it is the data prefix or some other filtering going on

shankari commented 1 year ago

It is not the data prefix

I changed

    {'label': 'data.trip_end_time_str', 'path': 'data.end_fmt_time'},

and reinstated the drop line

df = df.drop(columns=[named_col['path']])

The records looked like

{'data.distance': 15145.48585499552, 'data.duration': 892.0510001182556, 'user_id': 'bd9cffc8-dbf1-402d-a479-f9f148ec9e60', 'trip_start_time_str': '2021-09-20T16:00:08-06:00', 'data.trip_end_time_str': '2021-09-20T16:15:00.051000-06:00', 'start_coordinates': [...], 'end_coordinates': [...]}

but the end str was not displayed

It looks like there is some additional filtering happening, potentially against the VALID_TRIP_COLS but I don't see where. I am going to hack this by commenting out the drop statement (since we don't currently end up with duplicates) and will let @AlirezaRa94 fix it the right way once he returns

shankari commented 1 year ago

Aha! I think I figured it out. In pages/data.py, we see

    elif tab == 'tab-trips-datatable':
        data = store_trips["data"]
        columns = perm_utils.get_allowed_trip_columns()
        columns.update(
            col['label'] for col in perm_utils.get_allowed_named_trip_columns()
        )
        has_perm = perm_utils.has_permission('data_trips')

and the allowed trip columns are basically only the ones in VALID_TRIP_COLS

def get_allowed_trip_columns():
    columns = set(constants.VALID_TRIP_COLS)
    for column in permissions.get("data_trips_columns_exclude", []):
        columns.discard(column)
    return columns

shankari commented 1 year ago

@AlirezaRa94 This is seriously problematic because it means that we will not be able to display any labels, which means that the trip table is largely useless for admins.

shankari commented 1 year ago

Ok so refactored the code in a somewhat sloppy fashion to create a new get_display_trip_columns which returns the labels from the named columns

def get_display_trip_columns():
    columns = set()
    # logging.debug("get_display_trip_columns: curr set is %s" % columns)
    columns.update(get_allowed_trip_columns())
    # logging.debug("get_display_trip_columns: curr set is %s" % columns)
    columns.update(
        col['label'] for col in get_allowed_named_trip_columns()
    )
    # logging.debug("get_display_trip_columns: curr set is %s" % columns)
    columns.update(get_display_required_columns())
    # logging.debug("get_display_trip_columns: curr set is %s" % columns)
    return columns

But now I get an error

Invalid argument `data[0].start_coordinates` passed into DataTable.
Expected one of type [string, number, boolean].

Although the start coordinates seem to be formatted the same way in both cases.

{'user_id': 'bd9cffc8-dbf1-402d-a479-f9f148ec9e60', 'data.distance': 15145.48585499552, 'data.duration': 892.0510001182556, 'data.end_loc.coordinates': [....,....], 'trip_start_time_str': '2021-09-20T16:00:08-06:00', 'trip_end_time_str': '2021-09-20T16:15:00.051000-06:00', 'start_coordinates': [..., ....]}

{'data.distance': 15145.48585499552, 'user_id': 'bd9cffc8-dbf1-402d-a479-f9f148ec9e60', 'data.start_loc.coordinates': [...,...], 'data.end_loc.coordinates': [...,...], 'data.duration': 892.0510001182556, 'trip_start_time_str': '2021-09-20T16:00:08-06:00', 'trip_end_time_str': '2021-09-20T16:15:00.051000-06:00'}

shankari commented 1 year ago

Set up the multilabel defaults, corresponding to https://github.com/e-mission/op-admin-dashboard/issues/39

MULTILABEL_NAMED_COLS = [
    {'label': 'mode_confirm', 'path': 'data.user_input.mode_confirm'},
    {'label': 'purpose_confirm', 'path': 'data.user_input.purpose_confirm'},
    {'label': 'replaced_mode', 'path': 'data.user_input.replaced_mode'},
]

and although I have commented out the code to drop the paths, I don't see the paths and only the labels. This seems to be working (sort of) for now, so will defer a more principled fix until @AlirezaRa94 returns

e-mission / op-admin-dashboard

Switch to the timeseries to read data instead of accessing mongodb directly #29