Open iantei opened 7 months ago
Upon further investigation:
def load_all_confirmed_trips(tq):
agg = esta.TimeSeries.get_aggregate_time_series()
all_ct = agg.get_data_df("analysis/confirmed_trip", tq)
print("Loaded all confirmed trips of length %s" % len(all_ct))
print(f"Columns of all_ct: {all_ct.columns} \n")
disp.display(all_ct.head())
return all_ct
The all_ct data frame doesn't have additions
, inferred_section_summary
and cleaned_section_summary
columns.
Need to understand further, why these columns are missing - which is coming from analysis/confirmed_trip
.
Looking into the server side of code:
Inside emission/analysis/userinput/matcher.py
def create_confirmed_entry(ts, tce, confirmed_key, input_key_list):
# Copy the entry and fill in the new values
confirmed_object_data = copy.copy(tce["data"])
# del confirmed_object_dict["_id"]
# confirmed_object_dict["metadata"]["key"] = confirmed_key
if (confirmed_key == esda.CONFIRMED_TRIP_KEY):
confirmed_object_data["expected_trip"] = tce.get_id()
logging.debug("creating confimed entry from %s" % tce)
cleaned_trip = ts.get_entry_from_id(esda.CLEANED_TRIP_KEY,
tce["data"]["cleaned_trip"])
confirmed_object_data['inferred_section_summary'] = get_section_summary(ts, cleaned_trip, "analysis/inferred_section")
confirmed_object_data['cleaned_section_summary'] = get_section_summary(ts, cleaned_trip, "analysis/cleaned_section")
elif (confirmed_key == esda.CONFIRMED_PLACE_KEY):
confirmed_object_data["cleaned_place"] = tce.get_id()
confirmed_object_data["user_input"] = \
get_user_input_dict(ts, tce, input_key_list)
confirmed_object_data["additions"] = \
esdt.get_additions_for_timeline_entry_object(ts, tce)
return ecwe.Entry.create_entry(tce['user_id'], confirmed_key, confirmed_object_data)
We have "additions", "cleaned_section_summary" and "inferred_section_summary" missing. @shankari Could we have access to the server log so we can understand why this is happening?
When we do look at the server logs, I think it would help to look first for the log statements from 'get_section_summary'
Yesterday evening/this morning I had a problem with the sensed notebook on my survey additions branch see here. This isn't the same error, as it happened later when making the 80% chart, but we should keep an eye out for that case once this error is resolved and when testing the stacked bar chart changes.
Just checked on open-access
and the behavior there is different than the error I was working with. In my case, the number of trips (sensed)
was ok, but the entire notebook errored out on the number of trips under 80% (sensed)
chart. If it was the same error, I would expect to see the first chart, but all of the sensed charts are nulled out and none of them are showing.
Tried to load the dataset into Mongo, using the below script:
bash viz_scripts/docker/load_mongodump.sh <mongodump_file>
for the snapshot of open-access
dataset April 24. The dataset is considerable huge i.e. ~ 4.4 GB.
With the resource maxed to 16 GB Container Memory and 10 core for Container CPU Usage. The entire dataset could not be loaded, resulting in below case:
Terminal | Docker Resource Profile Chart |
---|---|
Corresponding to the resource usage on the right, the script exited early, as it reached the threshold of the container memory allocation.
Next: Trying with the Container Resource CPU core allocated to 16 core.
Please see the workaround for loading less data for testing the public dashboard
Error Stack:
Is there any cleaned_section_summary which has NaN values?: True
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 expanded_ct_sensed, file_suffix_sensed, quality_text_sensed, debug_df_sensed = scaffolding.load_viz_notebook_sensor_inference_data(year,
2 month,
3 program,
4 include_test_users,
5 sensed_algo_prefix)
File /usr/src/app/saved-notebooks/scaffolding.py:246, in load_viz_notebook_sensor_inference_data(year, month, program, include_test_users, sensed_algo_prefix)
242 if len(expanded_ct) > 0:
244 print(f"Is there any cleaned_section_summary which has NaN values?: {participant_ct_df['cleaned_section_summary'].isna().any()}")
--> 246 expanded_ct["primary_mode_non_other"] = participant_ct_df.cleaned_section_summary.apply(lambda md: max(md["distance"], key=md["distance"].get))
247 expanded_ct.primary_mode_non_other.replace({"ON_FOOT": "WALKING"}, inplace=True)
248 valid_sensed_modes = ["WALKING", "BICYCLING", "IN_VEHICLE", "AIR_OR_HSR", "UNKNOWN"]
File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/series.py:4771, in Series.apply(self, func, convert_dtype, args, **kwargs)
4661 def apply(
4662 self,
4663 func: AggFuncType,
(...)
4666 **kwargs,
4667 ) -> DataFrame | Series:
4668 """
4669 Invoke function on values of Series.
4670
(...)
4769 dtype: float64
4770 """
-> 4771 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/apply.py:1123, in SeriesApply.apply(self)
1120 return self.apply_str()
1122 # self.f is Callable
-> 1123 return self.apply_standard()
File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/apply.py:1174, in SeriesApply.apply_standard(self)
1172 else:
1173 values = obj.astype(object)._values
-> 1174 mapped = lib.map_infer(
1175 values,
1176 f,
1177 convert=self.convert_dtype,
1178 )
1180 if len(mapped) and isinstance(mapped[0], ABCSeries):
1181 # GH#43986 Need to do list(mapped) in order to get treated as nested
1182 # See also GH#25959 regarding EA support
1183 return obj._constructor_expanddim(list(mapped), index=obj.index)
File ~/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()
File /usr/src/app/saved-notebooks/scaffolding.py:246, in load_viz_notebook_sensor_inference_data.<locals>.<lambda>(md)
242 if len(expanded_ct) > 0:
244 print(f"Is there any cleaned_section_summary which has NaN values?: {participant_ct_df['cleaned_section_summary'].isna().any()}")
--> 246 expanded_ct["primary_mode_non_other"] = participant_ct_df.cleaned_section_summary.apply(lambda md: max(md["distance"], key=md["distance"].get))
247 expanded_ct.primary_mode_non_other.replace({"ON_FOOT": "WALKING"}, inplace=True)
248 valid_sensed_modes = ["WALKING", "BICYCLING", "IN_VEHICLE", "AIR_OR_HSR", "UNKNOWN"]
TypeError: 'float' object is not subscriptable
Added the below line of code:
{participant_ct_df['cleaned_section_summary'].isna().any()}
Result: Is there any cleaned_section_summary which has NaN values?: True
which shows there are NaN values for cleaned_section_summary
. Therefore, some operation over it would lead to the below error.
This seems identical to issue described here: Issue 93
Proposal for solution:
expanded_ct = participant_ct_df.copy()
expanded_ct = expanded_ct_copy.dropna(subset=['cleaned_section_summary'])
cleaned_section_summary
is NaN.@iantei dropna
will just paper over the real issue. The cleaned_summary_section
should always exist.
you can:
There are 3878 records which has NaN for cleaned_section_summary
.
``` nan_rows = participant_ct_df[participant_ct_df['cleaned_section_summary'].isna()] print(len(nan_rows)) end_fmt_times = [] for index, row in nan_rows.iterrows(): end_fmt_times.append(row['end_fmt_time']) end_fmt_times.sort() # Print the sorted list) for timestamp in end_fmt_times: print(timestamp) ```
There is an observation for pattern:
2022-07-07T20:52:04.129278-07:00
2022-07-07T21:46:43.999819-07:00
...
2023-08-04T15:10:53.000056-04:00
2023-08-04T15:15:06.000034-04:00
2023-08-04T17:28:29.755166-04:00
2023-08-04T18:45:00.000004-04:00
All these entries have timestamp with end_fmt_time
prior to the deliverable of #92 which was delivered on 11th September 2023.
This indicates a strong likelyhood of the possibility you mentioned about backwards compat code not being executed on this deployment.
Currently, there is issue with the generation of open-access-openpath: https://open-access-openpath.nrel.gov/public/
The primary reason is unavailability of column
cleaned_section_summary
inexpanded_ct
dataframe.Error call stack:
For the dataset
fc_*
, whichhas issue with creating sensed related charts, below are the expanded_ct columns:For the dataset
openpath_prod_cortezebikes
which doesn't have issue with creating sensed related charts.The difference in columns for
expanded_ct
while using these two dataset are enlisted below: