carlomazzaferro / scikit-hts-examples

Example usage of scikit-hts
MIT License
53 stars 22 forks source link

need some clarification with `from_geo_events` #9

Open Casyfill opened 3 years ago

Casyfill commented 3 years ago

I am trying to reuse the geo notebook as I have a very similar problem.

However, I couldn't make it run as I am getting this when run the HierarchyTree.from_geo_events:

>>> ht = HierarchyTree.from_geo_events(df=full2020.copy(), 
>>>                                    lat_col='lat', 
>>>                                    lon_col='lon',
>>>                                    nodes=('universe', 'part', 'town','subarea'),
>>>                                    resample_freq='1M',
>>>                                    min_count=.5
>>>                                   )
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-153-d57c8beffa60> in <module>
      4                                    nodes=('universe', 'part', 'town','subarea'),
      5                                    resample_freq='1M',
----> 6                                    min_count=.5
      7                                   )

~/anaconda3/envs/py37/lib/python3.7/site-packages/hts/hierarchy/__init__.py in from_geo_events(cls, df, lat_col, lon_col, nodes, levels, resample_freq, min_count, root_name, fillna)
     59                            freq=resample_freq,
     60                            min_count=min_count,
---> 61                            total=total)
     62         # TODO: more flexible strategy
     63         if fillna:

~/anaconda3/envs/py37/lib/python3.7/site-packages/hts/hierarchy/utils.py in groupify(root_node, df, freq, nodes, min_count, total)
     96             if len(sub_df) < allowance:
     97                 continue
---> 98             parent_name = sub_df[parent_group].value_counts().index[0]
     99             resampled = resample_count(sub_df, freq, child)
    100             for c in root_node.traversal_level():

~/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
   4093         if is_scalar(key):
   4094             key = com.cast_scalar_indexer(key, warn_float=True)
-> 4095             return getitem(key)
   4096 
   4097         if isinstance(key, slice):

IndexError: index 0 is out of bounds for axis 0 with size 0

Do you know what could be the reason for that?

On top of that, I have a few questions:

1) I wonder if it is possible to create tree without any lat/lon/h5 methods, e.g. if I am satisfied with my columnar hierarchy (universe->part->town->subarea) on it's own.

  1. I also noticed that even though the function fails, my original dataframe is modified, with a few h5-related columns. Do you mind me PR-ing some tweaks so that the original dataframe won't be affected unless inplace=True?
  2. In the notebook, you use min_count=0.5 - what does it mean? that count should be more than zero for specific node?
  3. It Seems documentation is somewhat lacking for geo function. Shall we add docs on how fillna works in this context? Do you mind me taking a stab on that?