Demographic table cleanup

achasmita commented 12 months ago

Need to remove extraneous columns from demographic table that are not necessary.

shankari commented 11 months ago

After the demographic table was added, a couple of the programs had their admin dashboard fail to startup. The error is

Traceback (most recent call last):
File "/usr/src/app/app_sidebar_collapsible.py", line 169, in <module>
demographics_data = update_store_demographics()
File "/usr/src/app/app_sidebar_collapsible.py", line 160, in update_store_demographics
df = query_demographics()
File "/usr/src/app/utils/db_utils.py", line 116, in query_demographics
df.drop(columns=['xmlns:jr', 'xmlns:orx', 'id'], inplace = True)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/frame.py", line 5399, in drop
return super().drop(
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/generic.py", line 4505, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/generic.py", line 4546, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 6934, in drop
raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: "['xmlns:jr', 'xmlns:orx', 'id'] not found in axis"

@achasmita

achasmita commented 11 months ago

After the demographic table was added, a couple of the programs had their admin dashboard fail to startup. The error is

Traceback (most recent call last):
File "/usr/src/app/app_sidebar_collapsible.py", line 169, in <module>
demographics_data = update_store_demographics()
File "/usr/src/app/app_sidebar_collapsible.py", line 160, in update_store_demographics
df = query_demographics()
File "/usr/src/app/utils/db_utils.py", line 116, in query_demographics
df.drop(columns=['xmlns:jr', 'xmlns:orx', 'id'], inplace = True)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/frame.py", line 5399, in drop
return super().drop(
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/generic.py", line 4505, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/generic.py", line 4546, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/root/miniconda-23.1.0/envs/emission/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 6934, in drop
raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: "['xmlns:jr', 'xmlns:orx', 'id'] not found in axis"

@achasmita

I was directly excluding those columns from demographic table before, those columns may not be in all datasets, so I think after applying new changes this error should go away.

shankari commented 11 months ago

I first thought that this may be because there are no demographic entries, but there are:

>>> edb.get_timeseries_db().find().distinct("metadata.key")
['background/filtered_location', 'manual/demographic_survey', 'config/app_ui_config', 'stats/client_nav_event', 'stats/server_api_time', 'background/battery', 'stats/client_error', 'background/location', 'statemachine/transition', 'background/motion_activity', 'config/consent', 'stats/client_time', 'stats/pipeline_time']

shankari commented 11 months ago

I was directly excluding those columns from demographic table before, those columns may not be in all datasets, so I think after applying new changes this error should go away.

Digging a bit deeper: there is one entry

>>> pd.json_normalize(list(edb.get_timeseries_db().find({"metadata.key": "manual/demographic_survey"})))
                        _id  ... data.local_dt.timezone
[1 rows x 67 columns]

and it does have the columns but with a different prefix

>>> pd.json_normalize(list(edb.get_timeseries_db().find({"metadata.key": "manual/demographic_survey"}))).columns
...
       'data.jsonDocResponse.data.__version__',
       'data.jsonDocResponse.data.meta.instanceID',
       'data.jsonDocResponse.data.attrxmlns:jr',
       'data.jsonDocResponse.data.attrxmlns:orx',
       'data.jsonDocResponse.data.attrid',
       'data.jsonDocResponse.data.attrversion', 'data.ts', 'data.fmt_time',

shankari commented 11 months ago

@achasmita I think that you mean this reason https://github.com/e-mission/op-admin-dashboard/pull/67/files#r1330803349 where we check to see if the column exists before dropping it will fix the problem.

Is that correct?

achasmita commented 11 months ago

I was directly excluding those columns from demographic table before, those columns may not be in all datasets, so I think after applying new changes this error should go away.

Digging a bit deeper: there is one entry
>>> pd.json_normalize(list(edb.get_timeseries_db().find({"metadata.key": "manual/demographic_survey"})))
                        _id  ... data.local_dt.timezone
[1 rows x 67 columns]
and it does have the columns but with a different prefix
>>> pd.json_normalize(list(edb.get_timeseries_db().find({"metadata.key": "manual/demographic_survey"}))).columns
...
       'data.jsonDocResponse.data.__version__',
       'data.jsonDocResponse.data.meta.instanceID',
       'data.jsonDocResponse.data.attrxmlns:jr',
       'data.jsonDocResponse.data.attrxmlns:orx',
       'data.jsonDocResponse.data.attrid',
       'data.jsonDocResponse.data.attrversion', 'data.ts', 'data.fmt_time',
I saw that those columns have attr added in front of previous columns so I have included both in new changes :
EXCLUDED_DEMOGRAPHICS_COLS = [
'data.xmlResponse', 
'data.name',
'data.version',
'data.label',
'xmlns:jr',
'xmlns:orx',
'id',
'start',
'end',
'attrxmlns:jr',
'attrxmlns:orx',
'attrid',
'__version__',
'attrversion',
'instanceID',
]

shankari commented 11 months ago

I am really curious as to why this was in previous survey responses but not in the most recent ones. We did switch to a newer version of enketo as part of the react rewrite.

Does switching to a newer version of enketo change the survey metadata? We might want to experiment with that and then set up a process for checking/fixing this when we upgrade enketo

@JGreenlee @Abby-Wheelis

shankari commented 11 months ago

I'm going to merge this now since it is a showstopper - the two most recent deployments uue and ride2own failed to upgrade because of this. Note that this is a server-only change.

e-mission / op-admin-dashboard

Demographic table cleanup #65