bashtage / linearmodels

Additional linear models including instrumental variable and panel data models that are missing from statsmodels.
https://bashtage.github.io/linearmodels/
University of Illinois/NCSA Open Source License
943 stars 184 forks source link

Model fit error: ZeroDivisionError: float division by zero #247

Closed yijunwang0805 closed 4 years ago

yijunwang0805 commented 4 years ago

Hi,

It is me again. Thank you for reading this issue.

From #246 we know that we have a dataframe,

import numpy as np
import pandas as pd
from linearmodels import PanelOLS
data = {'y':[1,2,3,1,0,3],
        'x1': [0,1,2,3,0,2],
        'x2':[1,1,3,2,1,0],
        't': pd.to_datetime(['2020-02-18', '2020-02-18', '2020-02-17', '2020-02-18', '2020-02-18', '2020-02-17']),
        'province': ['A', 'A','A','B','B','B'],
        'city': ['a','b','a','a','c','a']}
dataframe = pd.DataFrame (data, columns = ['y','x1', 'x2', 't', 'province', 'city'])
dataframe["city-provence"] = [(c,p) for c,p in zip(dataframe.city, dataframe.province)]
dataframe = dataframe.set_index(["city-provence","t"])
dataframe

When I try to run the panel regression

mod = PanelOLS(dataframe.y, dataframe[['x1','x2']], entity_effects=True)
mod.fit(cov_type='clustered', cluster_entity=True)

There is an error message

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-143-c9fcf08c6567> in <module>
      1 mod = PanelOLS(dataframe.y, dataframe[['x1','x2']], entity_effects=True)
----> 2 mod.fit(cov_type='clustered', cluster_entity=True)

C:\ProgramData\Anaconda3\lib\site-packages\linearmodels\panel\model.py in fit(self, use_lsdv, use_lsmr, low_memory, cov_type, debiased, auto_df, count_effects, **cov_config)
   1753             num = (resid_ss_pooled - resid_ss) / df_num
   1754 
-> 1755             denom = resid_ss / df_denom
   1756             stat = num / denom
   1757             f_pooled = WaldTestStatistic(

ZeroDivisionError: float division by zero

Intuitively, this means it was dividing by zero df_denom.

I wonder what have I done incorrectly. (From my actual dataframe, I have the same issue as illustrated in this example)

Thank you for your time.

Yijun Wang

yijunwang0805 commented 4 years ago

I was thinking, given that wald test statistic is given by

If

is essentially a zero matrix, that is, the parameters of x1 and x2 are zero, then it is dividing by zero. Hence, it yields the error.

However, if this is true, ideally, we should have just failed to reject the null hypothesis that

instead of getting the error.

bashtage commented 4 years ago

What appears to be happening here is that your model has no degrees of freedom left and so df_denom is 0. This is a bug and should be caught so that an InvalidTestStatistic can be returned rather than a WaldTestStatistic.

bashtage commented 4 years ago

If you had a richer dataset you would not see this issue.

yijunwang0805 commented 4 years ago

Hi Keven,

Thank you for the suggestion.

Since the degree of freedom is calculated by df = N - 1 where N is the sample size, I look up my sample size.

The dataset is 7276 rows × 5 columns, and below are the counts for each group

city-province
(akesu, xinjiang)               11
(ankang, shaanxi)               30
(anqing, anhui)                 30
(anshan, liaoning)              19
(haibei, qinghai)                9
(haikou, hainan)                32
(handan, hebei)                 23
(hangzhou, zhejiang)            31
(hanzhong, shaanxi)             29
(hebi, henan)                   23
(hechi, guangxi)                30
(hefei, anhui)                  30
(hegang, heilongjiang)          20
(heihe, heilongjiang)           18
(hengshui, hebei)               22
(hengyang, hunan)               27
(heyuan, guangdong)             23
(heze, shandong)                34
(hezhou, guangxi)               21
(honghe, yunnan)                 8
(huaibei, anhui)                27
(huaihua, hunan)                29
(huainan, anhui)                26
(huanggang, hubei)              35
(huangshan, anhui)              24
(huangshi, hubei)               32
(huizhou, guangdong)            38
(huludao, liaoning)             31
(huzhou, zhejiang)              25
(jiamusi, heilongjiang)         25
(jiangmen, guangdong)           29
(jiaozuo, henan)                23
(jiaxing, zhejiang)             30
(jieyang, guangdong)            25
(jilin, jilin)                  15
(jinan, shandong)               34
(jinchang, gansu)               19
(jincheng, shanxi)              18
(jingdezhen, jiangxi)           26
(jingmen, hubei)                35
(jingzhou, hubei)               35
(jinhua, zhejiang)              32
(jining, shandong)              34
(jinzhong, shanxi)              26
(jinzhou, liaoning)             28
(jiujiang, jiangxi)             31
(jixi, heilongjiang)            24
(kaifeng, henan)                23
(kunming, yunnan)               39
(laibin, guangxi)               13
(langfang, hebei)               24
(lanzhou, gansu)                32
(leshan, sichuan)               24
(liangshan, sichuan)             4
(ningde, fujian)                32
(panjin, liaoning)              31
etc

Ideally the sample size should be enough.

But just in case, I removed the (liangshan, sichuan) since it only has 4 observations df = df.drop(df[(df.city == 'liangshan') & (df.province == 'sichuan')].index). It worked!

I am curious, what about (tacheng, xinjiang)? It only has 2 observations.

I re-do the entire jupyternote book from the start. The same things which did not work a day ago and two days ago started to work.

I guess sometimes I just need to restart the laptop.

Thank you!

Yijun