askap-vast / vast-pipeline

This repository holds the code of the Radio Transient detection pipeline for the VAST project.
https://vast-survey.org/vast-pipeline/
MIT License
8 stars 3 forks source link

pandas=1.3.0 groupby sum regression #566

Closed marxide closed 3 years ago

marxide commented 3 years ago

Changes in pandas version 1.3.0 appear to drop non-numeric columns after a .groupby(...).agg('sum') operation. This affects the sky region ideal source coverage calculations – at least that's all I've found so more, there may be other affected areas of the pipeline.

For example, consider the following DataFrame:

df = pd.DataFrame({
    "source": [1, 1, 1, 2, 2, 3],
    "foo": ["a", "a", "b", "c", "c", "d"],
})
df["foo"] = df["foo"].apply(lambda x: [x,])
print(df)
   source  foo
0       1  [a]
1       1  [a]
2       1  [b]
3       2  [c]
4       2  [c]
5       3  [d]

Then if we group by source and sum the columns, we get different results between Pandas versions.

print(df.groupby("source").agg('sum'))

The pandas=1.2.4 output (as expected)

              foo
source           
1       [a, a, b]
2          [c, c]
3             [d]

The pandas=1.3.0 output is empty.

Empty DataFrame
Columns: []
Index: [1, 2, 3]

Changing the aggregation to the following appears to fix the issue. It works for both versions.

print(df.groupby("source").sum(numeric_only=False))
              foo
source           
1       [a, a, b]
2          [c, c]
3             [d]