Sensed related charts not being generated for open-access-openpath

iantei commented 4 months ago

Currently, there is issue with the generation of open-access-openpath: https://open-access-openpath.nrel.gov/public/

The primary reason is unavailability of column cleaned_section_summary in expanded_ct dataframe.

Error call stack:

AttributeError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 expanded_ct, file_suffix, quality_text, debug_df = scaffolding.load_viz_notebook_sensor_inference_data(year,
      2                                                                             month,
      3                                                                             program,
      4                                                                             include_test_users,
      5                                                                             sensed_algo_prefix)

File /usr/src/app/saved-notebooks/scaffolding.py:229, in load_viz_notebook_sensor_inference_data(year, month, program, include_test_users, sensed_algo_prefix)
    227 print(f"Expanded_ct columns: \n {expanded_ct.columns}")
    228 if len(expanded_ct) > 0:
--> 229     expanded_ct["primary_mode_non_other"] = participant_ct_df.cleaned_section_summary.apply(lambda md: max(md["distance"], key=md["distance"].get))
    230     expanded_ct.primary_mode_non_other.replace({"ON_FOOT": "WALKING"}, inplace=True)
    231     valid_sensed_modes = ["WALKING", "BICYCLING", "IN_VEHICLE", "AIR_OR_HSR", "UNKNOWN"]

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'cleaned_section_summary'

For the dataset fc_*, whichhas issue with creating sensed related charts, below are the expanded_ct columns:

 Expanded_ct columns: 
 Index(['source', 'end_ts', 'end_fmt_time', 'end_loc', 'raw_trip', 'start_ts',
       'start_fmt_time', 'start_loc', 'duration', 'distance', 'start_place',
       'end_place', 'cleaned_trip', 'inferred_labels', 'inferred_trip',
       'expectation', 'confidence_threshold', 'expected_trip', 'user_input',
       'start_local_dt_year', 'start_local_dt_month', 'start_local_dt_day',
       'start_local_dt_hour', 'start_local_dt_minute', 'start_local_dt_second',
       'start_local_dt_weekday', 'start_local_dt_timezone',
       'end_local_dt_year', 'end_local_dt_month', 'end_local_dt_day',
       'end_local_dt_hour', 'end_local_dt_minute', 'end_local_dt_second',
       'end_local_dt_weekday', 'end_local_dt_timezone', '_id', 'user_id',
       'metadata_write_ts'],
      dtype='object')

For the dataset openpath_prod_cortezebikes which doesn't have issue with creating sensed related charts.


Expanded_ct columns: 
 Index(['source', 'end_ts', 'end_fmt_time', 'end_loc', 'raw_trip', 'start_ts',
       'start_fmt_time', 'start_loc', 'duration', 'distance', 'start_place',
       'end_place', 'cleaned_trip', 'inferred_labels', 'inferred_trip',
       'expectation', 'confidence_threshold', 'expected_trip', 'user_input',
       'additions', 'inferred_section_summary', 'cleaned_section_summary',
       'start_local_dt_year', 'start_local_dt_month', 'start_local_dt_day',
       'start_local_dt_hour', 'start_local_dt_minute', 'start_local_dt_second',
       'start_local_dt_weekday', 'start_local_dt_timezone',
       'end_local_dt_year', 'end_local_dt_month', 'end_local_dt_day',
       'end_local_dt_hour', 'end_local_dt_minute', 'end_local_dt_second',
       'end_local_dt_weekday', 'end_local_dt_timezone', '_id', 'user_id',
       'metadata_write_ts'],
      dtype='object')
_default

The difference in columns for expanded_ct while using these two dataset are enlisted below:

-  'additions', 
- 'inferred_section_summary', 
- 'cleaned_section_summary'

iantei commented 4 months ago

Upon further investigation:

def load_all_confirmed_trips(tq):
    agg = esta.TimeSeries.get_aggregate_time_series()
    all_ct = agg.get_data_df("analysis/confirmed_trip", tq)
    print("Loaded all confirmed trips of length %s" % len(all_ct))
    print(f"Columns of all_ct: {all_ct.columns} \n")
    disp.display(all_ct.head())
    return all_ct

The all_ct data frame doesn't have additions, inferred_section_summary and cleaned_section_summary columns.

Need to understand further, why these columns are missing - which is coming from analysis/confirmed_trip.

iantei commented 4 months ago

Looking into the server side of code:

Inside emission/analysis/userinput/matcher.py

def create_confirmed_entry(ts, tce, confirmed_key, input_key_list):
    # Copy the entry and fill in the new values
    confirmed_object_data = copy.copy(tce["data"])
    # del confirmed_object_dict["_id"]
    # confirmed_object_dict["metadata"]["key"] = confirmed_key
    if (confirmed_key == esda.CONFIRMED_TRIP_KEY):
        confirmed_object_data["expected_trip"] = tce.get_id()
        logging.debug("creating confimed entry from %s" % tce)
        cleaned_trip = ts.get_entry_from_id(esda.CLEANED_TRIP_KEY,
            tce["data"]["cleaned_trip"])
        confirmed_object_data['inferred_section_summary'] = get_section_summary(ts, cleaned_trip, "analysis/inferred_section")
        confirmed_object_data['cleaned_section_summary'] = get_section_summary(ts, cleaned_trip, "analysis/cleaned_section")
    elif (confirmed_key == esda.CONFIRMED_PLACE_KEY):
        confirmed_object_data["cleaned_place"] = tce.get_id()
    confirmed_object_data["user_input"] = \
        get_user_input_dict(ts, tce, input_key_list)
    confirmed_object_data["additions"] = \
        esdt.get_additions_for_timeline_entry_object(ts, tce)
    return ecwe.Entry.create_entry(tce['user_id'], confirmed_key, confirmed_object_data)

We have "additions", "cleaned_section_summary" and "inferred_section_summary" missing. @shankari Could we have access to the server log so we can understand why this is happening?

Abby-Wheelis commented 4 months ago

When we do look at the server logs, I think it would help to look first for the log statements from 'get_section_summary'

Abby-Wheelis commented 4 months ago

Yesterday evening/this morning I had a problem with the sensed notebook on my survey additions branch see here. This isn't the same error, as it happened later when making the 80% chart, but we should keep an eye out for that case once this error is resolved and when testing the stacked bar chart changes.

Abby-Wheelis commented 4 months ago

Just checked on open-access and the behavior there is different than the error I was working with. In my case, the number of trips (sensed) was ok, but the entire notebook errored out on the number of trips under 80% (sensed) chart. If it was the same error, I would expect to see the first chart, but all of the sensed charts are nulled out and none of them are showing.

iantei commented 4 months ago

Tried to load the dataset into Mongo, using the below script:

bash viz_scripts/docker/load_mongodump.sh <mongodump_file>

for the snapshot of open-access dataset April 24. The dataset is considerable huge i.e. ~ 4.4 GB.

With the resource maxed to 16 GB Container Memory and 10 core for Container CPU Usage. The entire dataset could not be loaded, resulting in below case:

Terminal	Docker Resource Profile Chart

Corresponding to the resource usage on the right, the script exited early, as it reached the threshold of the container memory allocation.

Next: Trying with the Container Resource CPU core allocated to 16 core.

shankari commented 4 months ago

Please see the workaround for loading less data for testing the public dashboard

iantei commented 4 months ago

Error Stack:

Is there any cleaned_section_summary which has NaN values?: True
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 expanded_ct_sensed, file_suffix_sensed, quality_text_sensed, debug_df_sensed = scaffolding.load_viz_notebook_sensor_inference_data(year,
      2                                                                             month,
      3                                                                             program,
      4                                                                             include_test_users,
      5                                                                             sensed_algo_prefix)

File /usr/src/app/saved-notebooks/scaffolding.py:246, in load_viz_notebook_sensor_inference_data(year, month, program, include_test_users, sensed_algo_prefix)
    242 if len(expanded_ct) > 0:
    244     print(f"Is there any cleaned_section_summary which has NaN values?: {participant_ct_df['cleaned_section_summary'].isna().any()}")
--> 246     expanded_ct["primary_mode_non_other"] = participant_ct_df.cleaned_section_summary.apply(lambda md: max(md["distance"], key=md["distance"].get))
    247     expanded_ct.primary_mode_non_other.replace({"ON_FOOT": "WALKING"}, inplace=True)
    248     valid_sensed_modes = ["WALKING", "BICYCLING", "IN_VEHICLE", "AIR_OR_HSR", "UNKNOWN"]

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/series.py:4771, in Series.apply(self, func, convert_dtype, args, **kwargs)
   4661 def apply(
   4662     self,
   4663     func: AggFuncType,
   (...)
   4666     **kwargs,
   4667 ) -> DataFrame | Series:
   4668     """
   4669     Invoke function on values of Series.
   4670 
   (...)
   4769     dtype: float64
   4770     """
-> 4771     return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/apply.py:1123, in SeriesApply.apply(self)
   1120     return self.apply_str()
   1122 # self.f is Callable
-> 1123 return self.apply_standard()

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/apply.py:1174, in SeriesApply.apply_standard(self)
   1172     else:
   1173         values = obj.astype(object)._values
-> 1174         mapped = lib.map_infer(
   1175             values,
   1176             f,
   1177             convert=self.convert_dtype,
   1178         )
   1180 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1181     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1182     #  See also GH#25959 regarding EA support
   1183     return obj._constructor_expanddim(list(mapped), index=obj.index)

File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()

File /usr/src/app/saved-notebooks/scaffolding.py:246, in load_viz_notebook_sensor_inference_data.<locals>.<lambda>(md)
    242 if len(expanded_ct) > 0:
    244     print(f"Is there any cleaned_section_summary which has NaN values?: {participant_ct_df['cleaned_section_summary'].isna().any()}")
--> 246     expanded_ct["primary_mode_non_other"] = participant_ct_df.cleaned_section_summary.apply(lambda md: max(md["distance"], key=md["distance"].get))
    247     expanded_ct.primary_mode_non_other.replace({"ON_FOOT": "WALKING"}, inplace=True)
    248     valid_sensed_modes = ["WALKING", "BICYCLING", "IN_VEHICLE", "AIR_OR_HSR", "UNKNOWN"]

TypeError: 'float' object is not subscriptable

Added the below line of code:

{participant_ct_df['cleaned_section_summary'].isna().any()}

Result: Is there any cleaned_section_summary which has NaN values?: True which shows there are NaN values for cleaned_section_summary. Therefore, some operation over it would lead to the below error.

This seems identical to issue described here: Issue 93

iantei commented 4 months ago

Proposal for solution:

expanded_ct = participant_ct_df.copy()
expanded_ct = expanded_ct_copy.dropna(subset=['cleaned_section_summary'])

Create a copy of participant_ct_df such that the original df is not modified.
Drop the rows from the data frame wherever cleaned_section_summary is NaN.

shankari commented 4 months ago

@iantei dropna will just paper over the real issue. The cleaned_summary_section should always exist. you can:

see if there are patterns around missing section summaries - maybe the backwards compat code was not executed on this deployment
run the pipeline on the snapshot to see where it fails

iantei commented 4 months ago

There are 3878 records which has NaN for cleaned_section_summary.

Script to filter out NaN values' for cleaned_section_summary's end_fmt_time in sorted way

``` nan_rows = participant_ct_df[participant_ct_df['cleaned_section_summary'].isna()] print(len(nan_rows)) end_fmt_times = [] for index, row in nan_rows.iterrows(): end_fmt_times.append(row['end_fmt_time']) end_fmt_times.sort() # Print the sorted list) for timestamp in end_fmt_times: print(timestamp) ```

There is an observation for pattern:

2022-07-07T20:52:04.129278-07:00
2022-07-07T21:46:43.999819-07:00
...
2023-08-04T15:10:53.000056-04:00
2023-08-04T15:15:06.000034-04:00
2023-08-04T17:28:29.755166-04:00
2023-08-04T18:45:00.000004-04:00

All these entries have timestamp with end_fmt_time prior to the deliverable of #92 which was delivered on 11th September 2023. This indicates a strong likelyhood of the possibility you mentioned about backwards compat code not being executed on this deployment.

e-mission / em-public-dashboard

Sensed related charts not being generated for open-access-openpath #132