CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
217 stars 34 forks source link

Groupby+sum aggregate turns TensorArray into array of arrays #124

Closed frreiss closed 3 years ago

frreiss commented 4 years ago

sum aggregates applied inside a groupby operation produce a TensorArray where the backing array is an array of arrays instead of a single n-dimensional array.

Code to reproduce:

>>> import text_extensions_for_pandas as tp
    import numpy as np
    import pandas as pd

    df = pd.DataFrame({
        "a": ["foo", "bar"],
        "b": tp.TensorArray(np.array([[1, 2], [3, 4]]))
    })
    result = df.groupby("a").aggregate({"b": "sum"})
    print(result["b"].array)

Output:

array([array([3, 4]), array([1, 2])], dtype=object)

Expected output:

array([[3, 4], [1, 2]], dtype=int)
BryanCutler commented 4 years ago

@frreiss I couldn't reproduce. The output I get is a TensorArray and that's what it should be right? There is also a similar test for this already..

In [7]: df = pd.DataFrame({ 
   ...:             "a": ["foo", "bar"], 
   ...:             "b": tp.TensorArray(np.array([[1, 2], [3, 4]])) 
   ...: }) 
   ...: result = df.groupby("a").aggregate({"b": "sum"}) 
   ...: result["b"]                                                                                                          
Out[7]: 
a
bar    [3 4]
foo    [1 2]
Name: b, dtype: TensorDtype
BryanCutler commented 4 years ago

I wonder if it had something to do with some recent changes, would you mind trying it again?

frreiss commented 3 years ago

Tried again. The bug is still there. One note though: The last line of the repro script should read:

print(repr(result["b"].array))

(calling __repr__ instead of __str__ to get output that lists the dtype of the array).

BryanCutler commented 3 years ago

Ok, I'll take another look

BryanCutler commented 3 years ago

Oh yeah, I can reproduce now. I should have looked at it closer. Thanks for note, fixing it now.