FAST-HEP / fast-carpenter

Helping turn your trees into tables (ie. reads ROOT TTrees, writes summary Pandas DataFrames)
https://fast-hep.web.cern.ch
Other
9 stars 14 forks source link

BUG: `explode` function treats strings as iterable objects and explodes them too #109

Closed asnaylor closed 4 years ago

asnaylor commented 4 years ago

When i use the explode (from fast_carpenter.summary.binned_dataframe) it changes a vector of a string type to a number. It also does explode properly, it duplicates entries.

For Example, this is what the data looked like before using explode [using uproot.pandas.iterate to extract root data with flatten=False]:

pulsesTPC.pulseArea_phd pulsesTPC.rmsWidth_ns pulsesTPC.classification pulsesTPC.nPulses
[847.71246, 2.4795532] [960, 4974] [b'S2', b'S1'] 2
[1128.1444, 2.0983887] [1106, 5420] [b'S2', b'S1'] 2

And when i use explode:

pulsesTPC.nPulses pulsesTPC.pulseArea_phd pulsesTPC.rmsWidth_ns pulsesTPC.classification
2 847.712463 960 83
2 847.712463 960 50
2 2.479553 4974 83
2 2.479553 4974 49
2 1128.144409 1106 83
2 1128.144409 1106 50
2 2.098389 5420 83
2 2.098389 5420 49
benkrikler commented 4 years ago

Hmm, probably the behaviour for strings is because it notices that the dtype of that column is object and that the object type is iterable. That should be easy to solve.

And then the duplication you see is a consequence of this. The explode function itself works to arbitrary jaggedness, provided all columns can be broadcast to take that same jaggedness. That's the case here because each row on the input is a list of items where each list is the same length as the others in the same row. However, one of your columns contains lists where each item is a string. Since the strings are identified as iterable objects, the other lists are "broadcast" to match the jaggedness of the strings. This broadcasting behaviour is identical to how numpy handles broadcasting between a 1D array and 2D array: it duplicates the values to match the dimensionality. As final confirmation of this, you can see that the pulsesTPC.classification column after explode is giving you the ascii character code for each character in the strings "S2" and then "S1": "S2" = 83 then 50, "S1" = 83 then 49.

So really this is just one issue: strings are being interpreted as an iterable object. I'm going to change the name of this issue to reflect it. Thanks for reporting!

benkrikler commented 4 years ago

This should be fixed by PR #110 now, and part of the new release on pypi, ~v0.16.1. I'm going to close this then, but let me know if you see anything like this again!

asnaylor commented 4 years ago

I've updated fast_carpenter to the latest version on pip 0.17.1 but now i'm getting a key error when i run explode on the dataframe:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-11-15dcacf08c67> in <module>
     23 #get the events
     24 for df in uproot.pandas.iterate(fnames, b'Events', branch_names,namedecode="utf-8", flatten=False):
---> 25     df_list.append(explode(df))
     26

~/.conda/envs/lz-ml-env/lib/python3.8/site-packages/fast_carpenter/summary/binned_dataframe.py in explode(df)
    275     lst_cols = [col for col, dtype in df.dtypes.items() if is_object_dtype(dtype)]
    276     # Be more specific about which objects are ok
--> 277     lst_cols = [col for col in lst_cols if isinstance(df[col][0], _explodable_types)]
    278     if not lst_cols:
    279         return df

~/.conda/envs/lz-ml-env/lib/python3.8/site-packages/fast_carpenter/summary/binned_dataframe.py in <listcomp>(.0)
    275     lst_cols = [col for col, dtype in df.dtypes.items() if is_object_dtype(dtype)]
    276     # Be more specific about which objects are ok
--> 277     lst_cols = [col for col in lst_cols if isinstance(df[col][0], _explodable_types)]
    278     if not lst_cols:
    279         return df

~/.conda/envs/lz-ml-env/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key)
    869         key = com.apply_if_callable(key, self)
    870         try:
--> 871             result = self.index.get_value(self, key)
    872 
    873             if not is_scalar(result):

~/.conda/envs/lz-ml-env/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4403         k = self._convert_scalar_indexer(k, kind="getitem")
   4404         try:
-> 4405             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4406         except KeyError as e1:
   4407             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0