UDST / synthpop

Synthetic populations from census data
BSD 3-Clause "New" or "Revised" License
99 stars 47 forks source link

raise if ipf marginals do not #29

Open Eh2406 opened 8 years ago

coveralls commented 8 years ago

Coverage Status

Coverage increased (+0.2%) to 81.474% when pulling 9c710a0f890f53dc3fa5f6f01457eff9ae7f8a49 on Eh2406:master into 99f8b83dd31b3b395618ac915d3b7a76d95fde43 on UDST:master.

Eh2406 commented 7 years ago

Just notest that this is still around.

@janowicz what do you think?

janowicz commented 7 years ago

Thanks for the reminder @Eh2406. Prior to the new RuntimeError that this PR introduces, would the existing RuntimeError ("'Maximum number of iterations reached during IPF") already be encountered in situations that this PR is trying to catch? In the ipf.py in this PR, is it strictly true that IPF will not converge if sums.max() - sums.min() is greater than .01, or is it possible that IPF could converge if sums.max() - sums.min() is greater than .01 by a modest amount?

Eh2406 commented 7 years ago

Thanks for the thoughtful response. I will try and explain so we can improve on my pr. :-)

So this pr is testing for the situation where marginal sum to different values. This easily can occur with rounding. In our uses this led to approximately half the bg being off by one in the number of households. When this form of bad input is given to ipf there is no "correct" output.

So for example in the new test: cat_owner is (40 + 60) == 100 but car_color is (50 + 31 + 20) == 101 so constraints.sum() is either 100 or 101 but not both. So with every loop it adjusts to 100 when adjusting to cat_owner then back to 101 when adjusting car_color. The values of constraints oscillate up and down depending on what part of the loop we are in. This infinite oscillation is what I meant by "will not converge."

Prior to the new RuntimeError that this PR introduces, would the existing RuntimeError ("'Maximum number of iterations reached during IPF") already be encountered in situations that this PR is trying to catch?

No, the value of constraints oscillate but the loop only checks at one spot in the cycle so max_iterations will not be hit. But the "correctness" of this output is somewhat arbitrary; checking at a different part of the cycle will give you a different answer.

In the ipf.py in this PR, is it strictly true that IPF will not converge if sums.max() - sums.min() is greater than .01, or is it possible that IPF could converge if sums.max() - sums.min() is greater than .01 by a modest amount?

It will not converge by my definition if sums.max() - sums.min() > 0. .01 is an arbitrary small number, as testing for 0 is generally a bad idea. I have sean no mathematical analysis of what ipf dose with bad input. So I do not know what cut off is best.

Does this make sense? how can I improve this pr?