Closed slsongge closed 4 years ago
df_gb_cut_sum = df_raw.groupby('cut')['n'].agg(['sum']).reset_index() df_gb_cut_color_sum = df_raw.groupby(['cut','color'])['n'].agg(['sum']).reset_index() df_join = df_gb_cut_color_sum.merge(df_gb_cut_sum, how='left', on='cut') df_join['prob'] = df_join['sum_x'] / df_join['sum_y']
我看了下代码,其实你要做的是
cut
每一个分类里面,求聚合。cut
里面,对color
的算n
的占比。想要上面两步想要一次性完成。 @slsongge
@slsongge
list(untracked = "jinxiaosong/analysis/some_join.ipynb",
untracked = "jinxiaosong/data/pivot_help.xlsx")
# add @slsongge data
import pandas as pd
df_raw = pd.read_excel("../data/pivot_help.xlsx")
# df_raw.head()
df_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 276 entries, 0 to 275
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cut 276 non-null object
1 color 276 non-null object
2 clarity 276 non-null object
3 n 276 non-null int64
dtypes: int64(1), object(3)
memory usage: 8.8+ KB
#### 衍生占比【这段代码用了两次聚合,一次merge,太麻烦了,我想要进行优化】
df_gb_cut_sum = df_raw.groupby('cut')['n'].agg(['sum']).reset_index()
df_gb_cut_color_sum = df_raw.groupby(['cut','color'])['n'].agg(['sum']).reset_index()
df_join = df_gb_cut_color_sum.merge(df_gb_cut_sum, how='left', on='cut')
df_join['prob'] = df_join['sum_x'] / df_join['sum_y']
df_join.head(10)
cut | color | sum\_x | sum\_y | prob | |
---|---|---|---|---|---|
0 | Fair | D | 163 | 1610 | 0.101242 |
1 | Fair | E | 224 | 1610 | 0.139130 |
2 | Fair | F | 312 | 1610 | 0.193789 |
3 | Fair | G | 314 | 1610 | 0.195031 |
4 | Fair | H | 303 | 1610 | 0.188199 |
5 | Fair | I | 175 | 1610 | 0.108696 |
6 | Fair | J | 119 | 1610 | 0.073913 |
7 | Good | D | 662 | 4906 | 0.134937 |
8 | Good | E | 933 | 4906 | 0.190175 |
9 | Good | F | 909 | 4906 | 0.185283 |
df_raw.groupby('cut').apply(lambda x: x.groupby('color')['n'].sum()/x['n'].sum())
color | D | E | F | G | H | I | J |
---|---|---|---|---|---|---|---|
cut | |||||||
Fair | 0.101242 | 0.139130 | 0.193789 | 0.195031 | 0.188199 | 0.108696 | 0.073913 |
Good | 0.134937 | 0.190175 | 0.185283 | 0.177538 | 0.143090 | 0.106400 | 0.062576 |
Ideal | 0.131502 | 0.181105 | 0.177532 | 0.226625 | 0.144541 | 0.097118 | 0.041576 |
Premium | 0.116235 | 0.169458 | 0.169023 | 0.212022 | 0.171126 | 0.103546 | 0.058589 |
Very Good | 0.125228 | 0.198643 | 0.179109 | 0.190283 | 0.150968 | 0.099652 | 0.056117 |
@slsongge
df <- read_excel("jinxiaosong/data/pivot_help.xlsx")
df %>%
group_by(cut,color) %>%
summarise(pctg = sum(n)) %>%
ungroup() %>%
group_by(cut) %>%
mutate(pctg = pctg/sum(pctg)) %>%
spread(color,pctg)
## # A tibble: 5 x 8
## # Groups: cut [5]
## cut D E F G H I J
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Fair 0.101 0.139 0.194 0.195 0.188 0.109 0.0739
## 2 Good 0.135 0.190 0.185 0.178 0.143 0.106 0.0626
## 3 Ideal 0.132 0.181 0.178 0.227 0.145 0.0971 0.0416
## 4 Premium 0.116 0.169 0.169 0.212 0.171 0.104 0.0586
## 5 Very Good 0.125 0.199 0.179 0.190 0.151 0.0997 0.0561
我看了下你的代码,我发现你的思路是按照 dplyr 的方式去实现的。 就是说靠函数/pipeline,把每个 scalars 求出来,然后再进行汇总,这是 dplyr 的风格。 也就是是,你在每一次 group_by 后,都需要统筹全局的去想想。
pandas 更喜欢自定义函数,你分好块,然后只需要思考其中一块怎么 lambda 函数完成,然后向量化操作。
思维上会不太一样。 建议你看一下 https://github.com/JiaxiangBU/learn_credit_risk/pull/125
如果对两个变量进行分组,返回的结果就不是数据框了,从而不能进行转置了
df_raw.groupby(by=['cut','clarity']).apply(lambda x: x.groupby('color')['n'].sum()/x['n'].sum())
如果对两个变量进行分组,返回的结果就不是数据框了,从而不能进行转置了
df_raw.groupby(by=['cut','clarity']).apply(lambda x: x.groupby('color')['n'].sum()/x['n'].sum())
这个数据透视表,index 有两个 levels,column 是一个 level,所以这样返回也正常,你想要的反馈结果是什么样子的?回复下我看看,我来调整 apply 对 index 的设置。 @slsongge
@JiaxiangBU 下图是我现在面临的情况,目前我找到的最简单的方法就是写if判断就好。当然最直接的方法是在当前代码中调整参数。
@slsongge
for i,(j,x) in enumerate(df_raw.groupby(['cut','clarity'])):
if i <= 0:
print(i,j,x)
print(x.__class__)
x_series = x.groupby('color')['n'].sum()/x['n'].sum()
print(x_series.__class__)
x_df = x_series.to_frame().T
print(x_df.shape)
0 ('Fair', 'I1') cut color clarity n
0 Fair D I1 4
8 Fair E I1 9
15 Fair F I1 35
23 Fair G I1 53
31 Fair H I1 52
38 Fair I I1 34
45 Fair J I1 23
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
(1, 7)
因此充分了解其中的结构,固定进行转置.T
,而不要借助 groupby
合并时,默认转置。
df_raw.groupby(['cut','clarity']).apply(lambda x: pd.DataFrame(x.groupby('color')['n'].sum()/x['n'].sum()).T)
D | E | F | G | H | I | J | |||
---|---|---|---|---|---|---|---|---|---|
cut | clarity | ||||||||
Fair | I1 | n | 0.019048 | 0.042857 | 0.166667 | 0.252381 | 0.247619 | 0.161905 | 0.109524 |
IF | n | 0.333333 | NaN | 0.444444 | 0.222222 | NaN | NaN | NaN | |
SI1 | n | 0.142157 | 0.159314 | 0.203431 | 0.169118 | 0.183824 | 0.073529 | 0.068627 | |
SI2 | n | 0.120172 | 0.167382 | 0.190987 | 0.171674 | 0.195279 | 0.096567 | 0.057940 | |
VS1 | n | 0.029412 | 0.082353 | 0.194118 | 0.264706 | 0.188235 | 0.147059 | 0.094118 | |
VS2 | n | 0.095785 | 0.160920 | 0.203065 | 0.172414 | 0.157088 | 0.122605 | 0.088123 | |
VVS1 | n | 0.176471 | 0.176471 | 0.294118 | 0.176471 | 0.058824 | 0.058824 | 0.058824 | |
VVS2 | n | 0.130435 | 0.188406 | 0.144928 | 0.246377 | 0.159420 | 0.115942 | 0.014493 | |
Good | I1 | n | 0.083333 | 0.239583 | 0.197917 | 0.197917 | 0.145833 | 0.093750 | 0.041667 |
IF | n | 0.126761 | 0.126761 | 0.211268 | 0.309859 | 0.056338 | 0.084507 | 0.084507 | |
SI1 | n | 0.151923 | 0.227564 | 0.175000 | 0.132692 | 0.150641 | 0.105769 | 0.056410 | |
SI2 | n | 0.206290 | 0.186864 | 0.185939 | 0.150786 | 0.146161 | 0.074931 | 0.049029 | |
VS1 | n | 0.066358 | 0.137346 | 0.203704 | 0.234568 | 0.118827 | 0.158951 | 0.080247 | |
VS2 | n | 0.106339 | 0.163599 | 0.188139 | 0.196319 | 0.141104 | 0.112474 | 0.092025 | |
VVS1 | n | 0.069892 | 0.231183 | 0.188172 | 0.220430 | 0.166667 | 0.118280 | 0.005376 | |
VVS2 | n | 0.087413 | 0.181818 | 0.174825 | 0.262238 | 0.157343 | 0.090909 | 0.045455 | |
Ideal | I1 | n | 0.089041 | 0.123288 | 0.287671 | 0.109589 | 0.260274 | 0.116438 | 0.013699 |
IF | n | 0.023102 | 0.065182 | 0.221122 | 0.405116 | 0.186469 | 0.078383 | 0.020627 | |
SI1 | n | 0.172349 | 0.178888 | 0.141990 | 0.154134 | 0.178188 | 0.117702 | 0.056749 | |
SI2 | n | 0.137028 | 0.180523 | 0.174365 | 0.187067 | 0.173210 | 0.105466 | 0.042340 | |
VS1 | n | 0.097799 | 0.165227 | 0.171636 | 0.265534 | 0.130120 | 0.113681 | 0.056004 | |
VS2 | n | 0.181424 | 0.224019 | 0.173339 | 0.179452 | 0.109643 | 0.086373 | 0.045750 | |
VVS1 | n | 0.070347 | 0.163654 | 0.214949 | 0.290181 | 0.159257 | 0.087445 | 0.014167 | |
VVS2 | n | 0.108979 | 0.194551 | 0.199540 | 0.297007 | 0.110898 | 0.068304 | 0.020721 | |
Premium | I1 | n | 0.058537 | 0.146341 | 0.165854 | 0.224390 | 0.224390 | 0.117073 | 0.063415 |
IF | n | 0.043478 | 0.117391 | 0.134783 | 0.378261 | 0.173913 | 0.100000 | 0.052174 | |
SI1 | n | 0.155524 | 0.171748 | 0.170070 | 0.158322 | 0.183217 | 0.102657 | 0.058462 | |
SI2 | n | 0.142760 | 0.175992 | 0.177348 | 0.166836 | 0.176670 | 0.105799 | 0.054595 | |
VS1 | n | 0.065862 | 0.146807 | 0.145802 | 0.284565 | 0.168929 | 0.111111 | 0.076923 | |
VS2 | n | 0.100983 | 0.187370 | 0.184391 | 0.214775 | 0.158475 | 0.093834 | 0.060173 | |
VVS1 | n | 0.064935 | 0.170455 | 0.129870 | 0.277597 | 0.181818 | 0.136364 | 0.038961 | |
VVS2 | n | 0.108046 | 0.139080 | 0.167816 | 0.316092 | 0.135632 | 0.094253 | 0.039080 | |
Very Good | I1 | n | 0.059524 | 0.261905 | 0.154762 | 0.190476 | 0.142857 | 0.095238 | 0.095238 |
IF | n | 0.085821 | 0.160448 | 0.250000 | 0.294776 | 0.108209 | 0.070896 | 0.029851 | |
SI1 | n | 0.152469 | 0.193210 | 0.172531 | 0.146296 | 0.168827 | 0.110494 | 0.056173 | |
SI2 | n | 0.149524 | 0.211905 | 0.163333 | 0.155714 | 0.163333 | 0.095238 | 0.060952 | |
VS1 | n | 0.098592 | 0.165070 | 0.165070 | 0.243380 | 0.144789 | 0.115493 | 0.067606 | |
VS2 | n | 0.119259 | 0.194134 | 0.179853 | 0.184871 | 0.145118 | 0.105751 | 0.071015 | |
VVS1 | n | 0.065906 | 0.215463 | 0.220532 | 0.240811 | 0.145754 | 0.087452 | 0.024081 | |
VVS2 | n | 0.114170 | 0.241296 | 0.201619 | 0.244534 | 0.117409 | 0.057490 | 0.023482 |
df_raw.groupby(['cut']).apply(lambda x: pd.DataFrame(x.groupby('color')['n'].sum()/x['n'].sum()).T)
color | D | E | F | G | H | I | J | |
---|---|---|---|---|---|---|---|---|
cut | ||||||||
Fair | n | 0.101242 | 0.139130 | 0.193789 | 0.195031 | 0.188199 | 0.108696 | 0.073913 |
Good | n | 0.134937 | 0.190175 | 0.185283 | 0.177538 | 0.143090 | 0.106400 | 0.062576 |
Ideal | n | 0.131502 | 0.181105 | 0.177532 | 0.226625 | 0.144541 | 0.097118 | 0.041576 |
Premium | n | 0.116235 | 0.169458 | 0.169023 | 0.212022 | 0.171126 | 0.103546 | 0.058589 |
Very Good | n | 0.125228 | 0.198643 | 0.179109 | 0.190283 | 0.150968 | 0.099652 | 0.056117 |
我有一个数据框,四列:A、B、C、D D:是按照前三列分组进行的agg('count')
现在我要只对A、B分组,求D的count,然后对A分组求D的count,接下来join,最后求D的占比
数据
pivot_help.xlsx
我的实现代码
@JiaxiangBU