python：实现透视表中占比功能

slsongge commented 4 years ago

我有一个数据框，四列：A、B、C、D D：是按照前三列分组进行的agg('count')

现在我要只对A、B分组，求D的count，然后对A分组求D的count，接下来join，最后求D的占比

数据

我的实现代码

import pandas as pd

df_raw = pd.read_excel("D:/pivot_help.xlsx")
df_raw.head()

#### 衍生占比【这段代码用了两次聚合，一次merge，太麻烦了，我想要进行优化】
df_gb_cut_sum = df_raw.groupby('cut')['n'].agg(['sum']).reset_index()
df_gb_cut_color_sum = df_raw.groupby(['cut','color'])['n'].agg(['sum']).reset_index()

df_join = df_gb_cut_color_sum.merge(df_gb_cut_sum, how='left', on='cut')
df_join['prob'] = df_join['sum_x'] / df_join['sum_y']

#### 调整为支持可视化的数据
df_join_plot = pd.DataFrame(
    df_join \
      .drop(columns=['sum_x','sum_y'], axis=1) \
      .pivot(index='cut', columns='color', values='prob') \
      .to_records()
)

df_join_plot

@JiaxiangBU

JiaxiangBU commented 4 years ago

df_gb_cut_sum = df_raw.groupby('cut')['n'].agg(['sum']).reset_index()
df_gb_cut_color_sum = df_raw.groupby(['cut','color'])['n'].agg(['sum']).reset_index()

df_join = df_gb_cut_color_sum.merge(df_gb_cut_sum, how='left', on='cut')
df_join['prob'] = df_join['sum_x'] / df_join['sum_y']

我看了下代码，其实你要做的是

在对cut每一个分类里面，求聚合。
在某一个cut里面，对color的算n的占比。

想要上面两步想要一次性完成。 @slsongge

JiaxiangBU commented 4 years ago

@slsongge

list(untracked = "jinxiaosong/analysis/some_join.ipynb",
     untracked = "jinxiaosong/data/pivot_help.xlsx")
    # add @slsongge data

import pandas as pd

df_raw = pd.read_excel("../data/pivot_help.xlsx")
# df_raw.head()

df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 276 entries, 0 to 275
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   cut      276 non-null    object
 1   color    276 non-null    object
 2   clarity  276 non-null    object
 3   n        276 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 8.8+ KB

#### 衍生占比【这段代码用了两次聚合，一次merge，太麻烦了，我想要进行优化】
df_gb_cut_sum = df_raw.groupby('cut')['n'].agg(['sum']).reset_index()
df_gb_cut_color_sum = df_raw.groupby(['cut','color'])['n'].agg(['sum']).reset_index()

df_join = df_gb_cut_color_sum.merge(df_gb_cut_sum, how='left', on='cut')
df_join['prob'] = df_join['sum_x'] / df_join['sum_y']

df_join.head(10)

	cut	color	sum\_x	sum\_y	prob
0	Fair	D	163	1610	0.101242
1	Fair	E	224	1610	0.139130
2	Fair	F	312	1610	0.193789
3	Fair	G	314	1610	0.195031
4	Fair	H	303	1610	0.188199
5	Fair	I	175	1610	0.108696
6	Fair	J	119	1610	0.073913
7	Good	D	662	4906	0.134937
8	Good	E	933	4906	0.190175
9	Good	F	909	4906	0.185283

df_raw.groupby('cut').apply(lambda x: x.groupby('color')['n'].sum()/x['n'].sum())

color	D	E	F	G	H	I	J
cut
Fair	0.101242	0.139130	0.193789	0.195031	0.188199	0.108696	0.073913
Good	0.134937	0.190175	0.185283	0.177538	0.143090	0.106400	0.062576
Ideal	0.131502	0.181105	0.177532	0.226625	0.144541	0.097118	0.041576
Premium	0.116235	0.169458	0.169023	0.212022	0.171126	0.103546	0.058589
Very Good	0.125228	0.198643	0.179109	0.190283	0.150968	0.099652	0.056117

JiaxiangBU commented 4 years ago

@slsongge

df <- read_excel("jinxiaosong/data/pivot_help.xlsx")

df %>%
    group_by(cut,color) %>%
    summarise(pctg = sum(n)) %>% 
    ungroup() %>% 
    group_by(cut) %>% 
    mutate(pctg = pctg/sum(pctg)) %>% 
    spread(color,pctg)

## # A tibble: 5 x 8
## # Groups:   cut [5]
##   cut           D     E     F     G     H      I      J
##   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 Fair      0.101 0.139 0.194 0.195 0.188 0.109  0.0739
## 2 Good      0.135 0.190 0.185 0.178 0.143 0.106  0.0626
## 3 Ideal     0.132 0.181 0.178 0.227 0.145 0.0971 0.0416
## 4 Premium   0.116 0.169 0.169 0.212 0.171 0.104  0.0586
## 5 Very Good 0.125 0.199 0.179 0.190 0.151 0.0997 0.0561

我看了下你的代码，我发现你的思路是按照 dplyr 的方式去实现的。就是说靠函数/pipeline，把每个 scalars 求出来，然后再进行汇总，这是 dplyr 的风格。也就是是，你在每一次 group_by 后，都需要统筹全局的去想想。

pandas 更喜欢自定义函数，你分好块，然后只需要思考其中一块怎么 lambda 函数完成，然后向量化操作。

思维上会不太一样。建议你看一下 https://github.com/JiaxiangBU/learn_credit_risk/pull/125

slsongge commented 4 years ago

如果对两个变量进行分组，返回的结果就不是数据框了，从而不能进行转置了

df_raw.groupby(by=['cut','clarity']).apply(lambda x: x.groupby('color')['n'].sum()/x['n'].sum())

JiaxiangBU commented 4 years ago

如果对两个变量进行分组，返回的结果就不是数据框了，从而不能进行转置了
df_raw.groupby(by=['cut','clarity']).apply(lambda x: x.groupby('color')['n'].sum()/x['n'].sum())

这个数据透视表，index 有两个 levels，column 是一个 level，所以这样返回也正常，你想要的反馈结果是什么样子的？回复下我看看，我来调整 apply 对 index 的设置。 @slsongge

slsongge commented 4 years ago

@JiaxiangBU 下图是我现在面临的情况，目前我找到的最简单的方法就是写if判断就好。当然最直接的方法是在当前代码中调整参数。

JiaxiangBU commented 4 years ago

@slsongge

for i,(j,x) in enumerate(df_raw.groupby(['cut','clarity'])):
    if i <= 0:
        print(i,j,x)
        print(x.__class__)
        x_series = x.groupby('color')['n'].sum()/x['n'].sum()
        print(x_series.__class__)
        x_df = x_series.to_frame().T
        print(x_df.shape)

0 ('Fair', 'I1')      cut color clarity   n
0   Fair     D      I1   4
8   Fair     E      I1   9
15  Fair     F      I1  35
23  Fair     G      I1  53
31  Fair     H      I1  52
38  Fair     I      I1  34
45  Fair     J      I1  23
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
(1, 7)

因此充分了解其中的结构，固定进行转置.T，而不要借助 groupby 合并时，默认转置。

df_raw.groupby(['cut','clarity']).apply(lambda x: pd.DataFrame(x.groupby('color')['n'].sum()/x['n'].sum()).T)

结果太长。

			D	E	F	G	H	I	J
cut	clarity
Fair	I1	n	0.019048	0.042857	0.166667	0.252381	0.247619	0.161905	0.109524
	IF	n	0.333333	NaN	0.444444	0.222222	NaN	NaN	NaN
	SI1	n	0.142157	0.159314	0.203431	0.169118	0.183824	0.073529	0.068627
	SI2	n	0.120172	0.167382	0.190987	0.171674	0.195279	0.096567	0.057940
	VS1	n	0.029412	0.082353	0.194118	0.264706	0.188235	0.147059	0.094118
	VS2	n	0.095785	0.160920	0.203065	0.172414	0.157088	0.122605	0.088123
	VVS1	n	0.176471	0.176471	0.294118	0.176471	0.058824	0.058824	0.058824
	VVS2	n	0.130435	0.188406	0.144928	0.246377	0.159420	0.115942	0.014493
Good	I1	n	0.083333	0.239583	0.197917	0.197917	0.145833	0.093750	0.041667
	IF	n	0.126761	0.126761	0.211268	0.309859	0.056338	0.084507	0.084507
	SI1	n	0.151923	0.227564	0.175000	0.132692	0.150641	0.105769	0.056410
	SI2	n	0.206290	0.186864	0.185939	0.150786	0.146161	0.074931	0.049029
	VS1	n	0.066358	0.137346	0.203704	0.234568	0.118827	0.158951	0.080247
	VS2	n	0.106339	0.163599	0.188139	0.196319	0.141104	0.112474	0.092025
	VVS1	n	0.069892	0.231183	0.188172	0.220430	0.166667	0.118280	0.005376
	VVS2	n	0.087413	0.181818	0.174825	0.262238	0.157343	0.090909	0.045455
Ideal	I1	n	0.089041	0.123288	0.287671	0.109589	0.260274	0.116438	0.013699
	IF	n	0.023102	0.065182	0.221122	0.405116	0.186469	0.078383	0.020627
	SI1	n	0.172349	0.178888	0.141990	0.154134	0.178188	0.117702	0.056749
	SI2	n	0.137028	0.180523	0.174365	0.187067	0.173210	0.105466	0.042340
	VS1	n	0.097799	0.165227	0.171636	0.265534	0.130120	0.113681	0.056004
	VS2	n	0.181424	0.224019	0.173339	0.179452	0.109643	0.086373	0.045750
	VVS1	n	0.070347	0.163654	0.214949	0.290181	0.159257	0.087445	0.014167
	VVS2	n	0.108979	0.194551	0.199540	0.297007	0.110898	0.068304	0.020721
Premium	I1	n	0.058537	0.146341	0.165854	0.224390	0.224390	0.117073	0.063415
	IF	n	0.043478	0.117391	0.134783	0.378261	0.173913	0.100000	0.052174
	SI1	n	0.155524	0.171748	0.170070	0.158322	0.183217	0.102657	0.058462
	SI2	n	0.142760	0.175992	0.177348	0.166836	0.176670	0.105799	0.054595
	VS1	n	0.065862	0.146807	0.145802	0.284565	0.168929	0.111111	0.076923
	VS2	n	0.100983	0.187370	0.184391	0.214775	0.158475	0.093834	0.060173
	VVS1	n	0.064935	0.170455	0.129870	0.277597	0.181818	0.136364	0.038961
	VVS2	n	0.108046	0.139080	0.167816	0.316092	0.135632	0.094253	0.039080
Very Good	I1	n	0.059524	0.261905	0.154762	0.190476	0.142857	0.095238	0.095238
	IF	n	0.085821	0.160448	0.250000	0.294776	0.108209	0.070896	0.029851
	SI1	n	0.152469	0.193210	0.172531	0.146296	0.168827	0.110494	0.056173
	SI2	n	0.149524	0.211905	0.163333	0.155714	0.163333	0.095238	0.060952
	VS1	n	0.098592	0.165070	0.165070	0.243380	0.144789	0.115493	0.067606
	VS2	n	0.119259	0.194134	0.179853	0.184871	0.145118	0.105751	0.071015
	VVS1	n	0.065906	0.215463	0.220532	0.240811	0.145754	0.087452	0.024081
	VVS2	n	0.114170	0.241296	0.201619	0.244534	0.117409	0.057490	0.023482

df_raw.groupby(['cut']).apply(lambda x: pd.DataFrame(x.groupby('color')['n'].sum()/x['n'].sum()).T)

	color	D	E	F	G	H	I	J
cut
Fair	n	0.101242	0.139130	0.193789	0.195031	0.188199	0.108696	0.073913
Good	n	0.134937	0.190175	0.185283	0.177538	0.143090	0.106400	0.062576
Ideal	n	0.131502	0.181105	0.177532	0.226625	0.144541	0.097118	0.041576
Premium	n	0.116235	0.169458	0.169023	0.212022	0.171126	0.103546	0.058589
Very Good	n	0.125228	0.198643	0.179109	0.190283	0.150968	0.099652	0.056117

JiaxiangBU / tutoring2

python：实现透视表中占比功能 #61

数据

我的实现代码