Closed asnaylor closed 4 years ago
Hmm, probably the behaviour for strings is because it notices that the dtype of that column is object
and that the object type is iterable. That should be easy to solve.
And then the duplication you see is a consequence of this. The explode
function itself works to arbitrary jaggedness, provided all columns can be broadcast to take that same jaggedness. That's the case here because each row on the input is a list of items where each list is the same length as the others in the same row. However, one of your columns contains lists where each item is a string. Since the strings are identified as iterable objects, the other lists are "broadcast" to match the jaggedness of the strings. This broadcasting behaviour is identical to how numpy handles broadcasting between a 1D array and 2D array: it duplicates the values to match the dimensionality. As final confirmation of this, you can see that the pulsesTPC.classification
column after explode
is giving you the ascii character code for each character in the strings "S2"
and then "S1"
: "S2" = 83 then 50, "S1" = 83 then 49.
So really this is just one issue: strings are being interpreted as an iterable object. I'm going to change the name of this issue to reflect it. Thanks for reporting!
This should be fixed by PR #110 now, and part of the new release on pypi, ~v0.16.1. I'm going to close this then, but let me know if you see anything like this again!
I've updated fast_carpenter
to the latest version on pip 0.17.1
but now i'm getting a key error when i run explode on the dataframe:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-11-15dcacf08c67> in <module>
23 #get the events
24 for df in uproot.pandas.iterate(fnames, b'Events', branch_names,namedecode="utf-8", flatten=False):
---> 25 df_list.append(explode(df))
26
~/.conda/envs/lz-ml-env/lib/python3.8/site-packages/fast_carpenter/summary/binned_dataframe.py in explode(df)
275 lst_cols = [col for col, dtype in df.dtypes.items() if is_object_dtype(dtype)]
276 # Be more specific about which objects are ok
--> 277 lst_cols = [col for col in lst_cols if isinstance(df[col][0], _explodable_types)]
278 if not lst_cols:
279 return df
~/.conda/envs/lz-ml-env/lib/python3.8/site-packages/fast_carpenter/summary/binned_dataframe.py in <listcomp>(.0)
275 lst_cols = [col for col, dtype in df.dtypes.items() if is_object_dtype(dtype)]
276 # Be more specific about which objects are ok
--> 277 lst_cols = [col for col in lst_cols if isinstance(df[col][0], _explodable_types)]
278 if not lst_cols:
279 return df
~/.conda/envs/lz-ml-env/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key)
869 key = com.apply_if_callable(key, self)
870 try:
--> 871 result = self.index.get_value(self, key)
872
873 if not is_scalar(result):
~/.conda/envs/lz-ml-env/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
4403 k = self._convert_scalar_indexer(k, kind="getitem")
4404 try:
-> 4405 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4406 except KeyError as e1:
4407 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 0
When i use the
explode
(from fast_carpenter.summary.binned_dataframe
) it changes a vector of a string type to a number. It also does explode properly, it duplicates entries.For Example, this is what the data looked like before using explode [using
uproot.pandas.iterate
to extract root data withflatten=False
]:And when i use
explode
: