UDST / synthpop

Synthetic populations from census data
BSD 3-Clause "New" or "Revised" License
98 stars 47 forks source link

scipy.stats.chisquare throws error in compare_constraints.py due to unequal sum of frequencies #75

Open doorleyr opened 2 years ago

doorleyr commented 2 years ago

Hi, thanks for the great library. I'm trying to use this to generate some populations and I'm consistently getting the same error, even when I run the examples in this repository.

"ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:" ...

This error is coming from scipy.stats.chisquare which is called by the compare_to_constraints function. It appears that the sum of constraints.values is not always equal to the sum of counts.values (within the required tolerance) and this causes the error.

timothywhiteley commented 2 years ago

Hi doorleyr, I came across the same issue. I believe that there has been a change to scipy (no error given on scipy 1.4.1). I bodged my code by changing synthpop/draw.py Line 167 to:
return chisquare(counts.values, constraints.values*np.mean(counts.values)/np.mean(constraints.values)) so that the frequencies is the same but I think there must be a better solution.

PyMap commented 2 years ago

Hi all, As both are pointing out, the scipy.stats.chisquare used here to compare synthetic population against target constraints has been updated. A new test was included in the chisquare class to check if observed vs expected frecuencies have the same sum or not. And if they don't, then an error is raised. Following the discussions in the hiperlinked PR above, it seems that chisquare test is valid only in the case that both frequency vectors sum to the same total.

The workaround proporsed by @timothywhiteley seems to be a good solution to achieve that requirement. I also tried by normalizing both vectors to still have distributions that do not match but that sum up to 1 (the same total).

Below the results:

/home/federico/federico/urbansim/spop/synthpop/synthpop/ipu/ipu.py:192: RuntimeWarning: divide by zero encountered in double_scalars
  adj = constraint / float((column * weights).sum())
Drawing 765 households
> /home/federico/federico/urbansim/spop/synthpop/synthpop/draw.py(171)compare_to_constraints()
-> return chisquare(counts.values, constraints.values)
(Pdb) chi # original chisquare test ran with previous scipy version (brefore the new test)
Power_divergenceResult(statistic=116.96711825518415, pvalue=1.5273460185773555e-06)
(Pdb) norm_chi # chisquare test with normalized vectors
Power_divergenceResult(statistic=0.04170304232710738, pvalue=1.0)
(Pdb) chisquare(counts.values, constraints.values*np.mean(counts.values)/np.mean(constraints.values)) # timothy workaround
Power_divergenceResult(statistic=118.01960978571391, pvalue=1.1317109362713438e-06)

As expected, normalizing by the constraints mean is closer to the result obtained before the new scipy test. I think we can still compare results by checking the resultant amount of persons by category trowed by the synthesizer. (cc @msoltadeo @alephcero @janowicz )

bhaveshneekhra commented 1 year ago

Is it resolved yet? @PyMap and @timothywhiteley the workaround does not seem to work as on Sat, 14 Jan, 2023. (I remember it worked for me on 18th Oct, 2022). Did something change in between?

werdnabae commented 1 year ago

Is it resolved yet? @PyMap and @timothywhiteley the workaround does not seem to work as on Sat, 14 Jan, 2023. (I remember it worked for me on 18th Oct, 2022). Did something change in between?

The workaround given by @timothywhiteley is still working for me

zaturalma2 commented 1 year ago

I do not understand why is it required to have the expected and observed counts to be the same. The test statistic can be calculated without this restriction ( sum{(observed_i-expected_i)^2/expected_i} ), the distribution table is given, I can not see any reason for this.

bbarrios6 commented 6 months ago

Work around not working Dec 21 2023, any solutions?

bhaveshneekhra commented 4 months ago

Hi doorleyr, I came across the same issue. I believe that there has been a change to scipy (no error given on scipy 1.4.1). I bodged my code by changing synthpop/draw.py Line 167 to: return chisquare(counts.values, constraints.values*np.mean(counts.values)/np.mean(constraints.values)) so that the frequencies is the same but I think there must be a better solution.

Worked for me on macos 14.3.1 with python 3.11