e-mission / op-admin-dashboard

An admin/deployer dashboard for the NREL OpenPATH platform
0 stars 8 forks source link

Trajectory table not visible in staging. #70

Closed achasmita closed 10 months ago

achasmita commented 10 months ago

It does work on dev against the durham dataset.

shankari commented 10 months ago

Here's what I see in the admin dashboard logs

...
'data.section': '644ac9d8e7d077320f6b250a', 'data.filter': nan, 'data.floor': nan, 'data.sensed_speed': nan, 'data.vaccuracy': nan, 'data.accuracy': nan, 'data.elapsedRealtimeNanos': nan, 'data.local_dt_year': nan, 'data.local_dt_month': nan, 'data.local_dt_day': nan, 'data.local_dt_hour': nan, 'data.local_dt_minute': nan, 'data.local_dt_second': nan, 'data.local_dt_weekday': nan, 'data.local_dt_timezone': nan, 'data.user_id': nan, 'data.metadata_write_ts': nan}], export_format='csv', page_current=0, page_size=50, filter_options={'case': 'sensitive'}, sort_action='native', sort_mode='single', style_table={'overflowX': 'auto'}, style_cell={'textAlign': 'left'})`
--
In general, Dash properties can only be
dash components, strings, dictionaries, numbers, None,
or lists of those.
shankari commented 10 months ago

@achasmita do you see nan values in the trajectory table in your testing?

achasmita commented 10 months ago

@achasmita do you see nan values in the trajectory table in your testing?

Those all columns are empty on my table so it is working fine on my side. Screen Shot 2023-09-28 at 3 55 39 PM

achasmita commented 10 months ago

and I do not see nan values either

shankari commented 10 months ago

maybe the nan values are just not displayed. Can you print out the last 5-6 rows to see if they do in fact have NaN values?

achasmita commented 10 months ago

There is actually NaN values in previous dataset Screen Shot 2023-09-28 at 5 01 11 PM

shankari commented 10 months ago

@achasmita

poking around on staging, this is what I see. If I try to load all trajectories, it takes a long time to load. If I filter to only Sept (1/9 to 23/9), though, I get the same data as the previously loaded tab

UUID table

Screenshot 2023-09-28 at 5 24 44 PM

Demographic tab

Screenshot 2023-09-28 at 5 26 56 PM

Trajectory tab (still shows demographics)

Screenshot 2023-09-28 at 5 29 18 PM

UUID tab

Screenshot 2023-09-28 at 5 30 30 PM

Trajectory tab (still shows UUID)

Screenshot 2023-09-28 at 5 31 51 PM

Is this what you meant by "trajectory table not visible"? Again, detailed reporting would be good.

shankari commented 10 months ago

@achasmita is the trajectory table being filtered by the time range at the top of the dashboard? I don't remember seeing that during the review.

achasmita commented 10 months ago

@achasmita is the trajectory table being filtered by the time range at the top of the dashboard? I don't remember seeing that during the review.

No it is not filtered by time range

achasmita commented 10 months ago

@achasmita

poking around on staging, this is what I see. If I try to load all trajectories, it takes a long time to load. If I filter to only Sept (1/9 to 23/9), though, I get the same data as the previously loaded tab

UUID table

Screenshot 2023-09-28 at 5 24 44 PM

Demographic tab

Screenshot 2023-09-28 at 5 26 56 PM

Trajectory tab (still shows demographics)

Screenshot 2023-09-28 at 5 29 18 PM

UUID tab

Screenshot 2023-09-28 at 5 30 30 PM

Trajectory tab (still shows UUID)

Screenshot 2023-09-28 at 5 31 51 PM

Is this what you meant by "trajectory table not visible"? Again, detailed reporting would be good.

Yes it is loading previously loaded table instead of trajectory table.

shankari commented 10 months ago

@achasmita

No it is not filtered by time range

We should make sure that it is filtered by the time range since it is also time dependent, just like the trip table. That would also make it easier to debug issues like this since we can load the table in sections and find the time range with the invalid data.

achasmita commented 10 months ago

I found that there are some extra columns in this dataset compared to previous dataset. So excluding these columns load the table for now.

EXCLUDED_TRAJECTORIES_COLS = [
    'data.local_dt_year',
    'data.local_dt_month',
    'data.local_dt_day',
    'data.local_dt_hour',
    'data.local_dt_minute',
    'data.local_dt_second',
    'data.local_dt_weekday',
    'data.local_dt_timezone',
    'data.user_id',
    'data.metadata_write_ts',
]
achasmita commented 10 months ago

Most of above columns have Nan value. I am still checking columns values. Also there are already similar columns in table:

'data.local_dt.year',
'data.local_dt.month',
'data.local_dt.day',
'data.local_dt.hour',
'data.local_dt.minute',
'data.local_dt.second',
'data.local_dt.weekday',
'data.local_dt.timezone',
achasmita commented 10 months ago

Screen Shot 2023-09-28 at 7 15 39 PM For now I am checking first five and last five data for the excluded columns.

shankari commented 10 months ago

Most of above columns have Nan value. I am still checking columns values. Also there are already similar columns in table:

That is so bizarre. Let me see what I can find from the raw data

shankari commented 10 months ago

so the stage database has a lot of locations

>>> import emission.core.get_database as edb

>>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/recreated_location"})
322546

and most of them have data.local_dt.year

>>> edb.get_analysis_timeseries_db().count_documents({"data.local_dt.year": {"$exists": True}})
322535

there are only a handful of entries for data.local_dt_year

>>> edb.get_analysis_timeseries_db().count_documents({"data.local_dt_year": {"$exists": True}})
11
shankari commented 10 months ago

Looking more closely at those entries, most of them are from Apr of this year, although there are some from as late as June

>>> import pandas as pd
>>> pd.json_normalize(list(edb.get_analysis_timeseries_db().find({"data.local_dt_year": {"$exists": True}})))
>>> all_broken_data[["metadata.write_fmt_time"]]
             metadata.write_fmt_time
0   2023-04-06T12:35:19.462114-07:00
1   2023-04-13T14:19:07.463665-07:00
2   2023-04-13T14:19:40.625654-07:00
3   2023-04-13T14:19:25.698613-07:00
4   2023-04-13T14:26:30.323517-07:00
5   2023-04-13T14:26:35.589886-07:00
6   2023-04-13T14:26:52.780819-07:00
7   2023-04-13T14:30:46.099731-07:00
8   2023-05-04T18:24:49.723777-07:00
9   2023-05-19T10:18:11.300575-07:00
10  2023-06-13T13:19:53.900964-07:00

And all of them have data.local_dt_year

>>> all_broken_data[["metadata.write_fmt_time", "data.local_dt_year"]]
             metadata.write_fmt_time  data.local_dt_year
0   2023-04-06T12:35:19.462114-07:00                2022
1   2023-04-13T14:19:07.463665-07:00                2022
2   2023-04-13T14:19:40.625654-07:00                2022
3   2023-04-13T14:19:25.698613-07:00                2022
4   2023-04-13T14:26:30.323517-07:00                2022
5   2023-04-13T14:26:35.589886-07:00                2022
6   2023-04-13T14:26:52.780819-07:00                2022
7   2023-04-13T14:30:46.099731-07:00                2023
8   2023-05-04T18:24:49.723777-07:00                2016
9   2023-05-19T10:18:11.300575-07:00                2023
10  2023-06-13T13:19:53.900964-07:00                2023

But do not have data.local_dt.year

>>> "data.local_dt.year" in all_broken_data.columns
False
shankari commented 10 months ago

@achasmita So it's pretty clear what is going on. Can you summarize the problem, and outline a principled solution? Given that this is a data issue, we should change the code to be robust to the data. I would prefer to not have a bunch of hardcoded workarounds determined by trial and error

achasmita commented 10 months ago

I began by excluding additional columns present in this table compared to the previous table, as most of them contained NaN values. Now,I removed them from the excluded list and the table is functioning well, with the exception of the data.user_id column. So, currently I have only data.user_id column excluded.

achasmita commented 10 months ago

Screen Shot 2023-09-28 at 8 23 17 PM

Screen Shot 2023-09-28 at 8 24 16 PM

Screen Shot 2023-09-28 at 8 24 51 PM

Screen Shot 2023-09-28 at 8 25 19 PM

Screen Shot 2023-09-28 at 8 26 53 PM

shankari commented 10 months ago

I began by excluding additional columns present in this table compared to the previous table, as most of them contained NaN values. Now,I removed them from the excluded list and the table is functioning well, with the exception of the data.user_id column. So, currently I have only data.user_id column excluded.

@achasmita ok... that is not what I was expecting. But that doesn't explain why the table worked for Durham but not for stage. Does Durham not have data.user_id?

achasmita commented 10 months ago

yes there was no data.user_id column in durham , First I tried replacing NaN value with " "but NaN value was not the problem, now all the columns are working as it include some binary data.

achasmita commented 10 months ago

There is no more excluded columns. Screen Shot 2023-09-28 at 8 39 49 PM Screen Shot 2023-09-28 at 8 40 45 PM Screen Shot 2023-09-28 at 8 41 18 PM Screen Shot 2023-09-28 at 8 41 55 PM Screen Shot 2023-09-28 at 8 42 18 PM

shankari commented 10 months ago

@achasmita Interesting - let me poke at the data again to see where those entries came from. I am glad to see that you were able to identify and work around it through manual inspection. But this indicates that dash is not robust to binary data in the tables. What if there is some other column in some other partner deployment that has one binary value? Our trajectory load will break.

Please look through the pandas documentation - let's see if there is a more principled fix that makes the system robust to bad data. If it takes too long to figure out a workaround, I am fine with filing an issue and putting it into the next release. However, if it is a simple fix, it would be great to include it now.

shankari commented 10 months ago

Re-running my previous commands, I see

>>> edb.get_analysis_timeseries_db().count_documents({"data.user_id": {"$exists": True}})
11
>>> edb.get_analysis_timeseries_db().count_documents({"data.user_id": {"$exists": False}})
1131337

It looks like these are basically the same incorrect 11 entries - every row in all_broken_data has data.user_id @achasmita as a hint - the dtype of data.user_id is object. Maybe that can give us a clue as to how to identify binary values....

>>> all_broken_data['data.user_id']
Name: data.user_id, dtype: object
achasmita commented 10 months ago

@achasmita Interesting - let me poke at the data again to see where those entries came from. I am glad to see that you were able to identify and work around it through manual inspection. But this indicates that dash is not robust to binary data in the tables. What if there is some other column in some other partner deployment that has one binary value? Our trajectory load will break.

Please look through the pandas documentation - let's see if there is a more principled fix that makes the system robust to bad data. If it takes too long to figure out a workaround, I am fine with filing an issue and putting it into the next release. However, if it is a simple fix, it would be great to include it now.

I will look into it.

shankari commented 10 months ago

@achasmita It looks like both strings and objects are marked with dtype object

>>> all_broken_data.dtypes
_id                                  object
user_id                              object
metadata.key                         object
metadata.platform                    object
metadata.write_ts                   float64
metadata.time_zone                   object
metadata.write_local_dt.year          int64
metadata.write_local_dt.month         int64
metadata.write_local_dt.day           int64
metadata.write_local_dt.hour          int64
metadata.write_local_dt.minute        int64
metadata.write_local_dt.second        int64
metadata.write_local_dt.weekday       int64
metadata.write_local_dt.timezone     object
metadata.write_fmt_time              object
data.accuracy                       float64
data.altitude                       float64
data.elapsedRealtimeNanos             int64
data.filter                          object
data.fmt_time                        object
data.latitude                       float64
data.longitude                      float64
data.sensed_speed                   float64
data.ts                             float64
data.loc.type                        object
data.loc.coordinates                 object
data.local_dt_year                    int64
data.local_dt_month                   int64
data.local_dt_day                     int64
data.local_dt_hour                    int64
data.local_dt_minute                  int64
data.local_dt_second                  int64
data.local_dt_weekday                 int64
data.local_dt_timezone               object
data.user_id                         object
data.metadata_write_ts              float64
data.distance                         int64
data.speed                            int64
data.heading                          int64
data.idx                              int64
data.mode                             int64
data.section                         object

It seems like a fairly naive fix is, instead of hardcoding the binary columns https://github.com/e-mission/op-admin-dashboard/blob/35fc91dd245e65f086b2a9af4f6a4711cfaec659/utils/db_utils.py#L129

we could just apply(str) to any column of type 'object'. Calling str() on a string won't hurt anything...

>>> str('abc')
'abc'
achasmita commented 10 months ago

I used your method so now we don't need to have separate binary columns.

   for col in df.columns:
            if df[col].dtype == 'object':
                df[col] = df[col].apply(str)  
shankari commented 10 months ago

submit a PR, I will merge and deploy

shankari commented 10 months ago

@achasmita while making the change to the PR, please also remember to filter the trajectory by the date range https://github.com/e-mission/op-admin-dashboard/issues/70#issuecomment-1740216079

shankari commented 10 months ago

@achasmita I just checked the trajectories, and I wonder if we could simplify the column list a bit by default.

  1. Since we have data.latitude and data.longitude, we don't need any of the data.loc columns
  2. since we have the fmt_time, we don't need any of the data.local_dt columns
  3. we should convert the data.mode to a string based on the enum in emission.core.wrapper
  4. for all programs other than the following, we should exclude the user_id field from the table. Can you write a script that will read each JSON file to be modified, add the data_trajectories_columns_exclude: ['user_id'] field to each of them and write the modified file? You can then do a giant commit that changes all of them.
  5. The list of programs that should not exclude the user_id is: ca-ebike, stage-program, stage-study, uprm-civic, uprm-nicr, usaid-laos-ev, uue, washingtoncommons. You should add data_trajectories_columns_exclude: ['user_id'] to all other programs

Please let me know when this is done so that I can move to production.

As an aside, I only see 5000 pages in the stage trajectory table. Does this actually represent the full set of entries? It almost seems like the list is being cutoff - I wonder it is is by the max number of entries for each query. https://github.com/search?q=repo%3Ae-mission%2Fe-mission-server+result_limit&type=code

We should fix this in the scalability changes as well.

niccolopaganini commented 10 months ago

@shankari kind of confused by the wording

5. The list of programs that should not exclude the user_id is: ca-ebike, stage-program, stage-study, uprm-civic, uprm-nicr, usaid-laos-ev, uue, washingtoncommons. You should add data_trajectories_columns_exclude: ['user_id'] to all other programs

Does this mean that all JSON files get the

data_trajectories_columns_exclude: ['user_id']

?

Or just the JSON files mentioned above

niccolopaganini commented 10 months ago

e JSON files mentioned above

I tried searching as "keywords" in all those files but only the file names matched

(ca-ebike.nrel-op.json, stage-study.nrel-op.json, stage-program.nrel-op.json, etc.)

hence the confusion

shankari commented 10 months ago

Does this mean that all JSON files get the

There is one config per program. I have provided the list of configs that need to be modified. Why would all files be changed?

I tried searching as "keywords" in all those files but only the file names matched

I don't understand your question, why are you searching for keywords?

niccolopaganini commented 10 months ago

I was just confused with the wording, that's all. I understood that those JSON files were the point of concern but then this part:

5. to all other programs

confused me

shankari commented 10 months ago

to all other programs. The programs that are not listed are the other programs.

shankari commented 10 months ago

closing this since this has been resolved with a manual change from @niccolopaganini and the script is being tracked in a separate PR https://github.com/e-mission/nrel-openpath-deploy-configs/pull/40