UDST / synthpop

Synthetic populations from census data
BSD 3-Clause "New" or "Revised" License
99 stars 47 forks source link

BUG: ipu not using all constraints #33

Closed Eh2406 closed 7 years ago

Eh2406 commented 7 years ago

Some constraints given to ipu are ignored. I just noted that the number of columns in _FrequencyAndConstraints.ncols is not equal to len(person_freq.columns) + len(household_freq.columns) This means that some of the constraints are not being used.

Now trying the 5 whys method:

Why 1?

Because the keys of OrderedDict must be unique, and the column names of person_freq and household_freq are sequential numbers. So the person constraints over write the household constraints.

Why 2?

Why are they sequential numbers, they are not in the tests? In the unit test the columns have unique meaningful names. Because we set them to sequential numbers. That is how it has been since @fscottfoti added it to git.

And that is as far as I go.

How do we fix?

Do we stop using a Dict as a backing for _FrequencyAndConstraints, what consequences does that have? Do we stop changing the index we send, what consequences does that have?

We noted this while we are finalizing our synthesized population so we could use a prompt fix.

janowicz commented 7 years ago

Thanks for catching this @Eh2406! Since you guys are probably most recently immersed in this code, do you have a sense for which of the two solution routes you mention seems most promising? If you think you're in a position to issue a pull request for this issue, go for it. I'll probably have capacity to dig into this too on Friday or early next week

Eh2406 commented 7 years ago

I'd lean toward removing h_constraint.index = h_jd.cat_id and p_constraint.index = p_jd.cat_id I could not figure out why it is necessary, but have not yet tried. @fscottfoti added it and probably had a good reason. alsw we need to check whether this is a user facing change.

I'd definitely add a assert self.ncols == len(person_freq.columns) + len(household_freq.columns) to and a test to ensure it does not sneak back.

fscottfoti commented 7 years ago

I'd guess it is necessary but I sure can't remember at this point. Also, why do you say it requires the index to be sequential integers? Aren't these lines doing the opposite - i.e. using the index that's passed in so the index doesn't have to be sequential integers?

Eh2406 commented 7 years ago

*_constraint.index is a descriptive name per row. *_jd.cat_id is sequential integers. So the code replaces names with integers. The bug is that names work, but integers don't. I will try removing those lines and see if it works this afternoon.

fscottfoti commented 7 years ago

Definitely seems like it's worth a try to remove and see if it still works for you.

Eh2406 commented 7 years ago

I have started work on the bug fix branch https://github.com/SEMCOG/synthpop/tree/bug_fix

Eh2406 commented 7 years ago

So It is not so simple. Removing that h_constraint.index = h_jd.cat_id and p_constraint.index = p_jd.cat_id did not cause the the tests to pass, the sequential integers come from https://github.com/SEMCOG/synthpop/blob/semcog/synthpop/synthesizer.py#L85 i.e. https://github.com/SEMCOG/synthpop/blob/semcog/synthpop/categorizer.py#L88

Eh2406 commented 7 years ago

@janowicz How is it going? Have you made progres?

janowicz commented 7 years ago

I have not gotten to this yet. Wasn't sure if you were already taking it on with the bug_fix branch above. I have time tomorrow to do some work on this.

Eh2406 commented 7 years ago

I added an assert to demonstrate the bug. But was unable to get the test to pass with the assert in place. and that is as far as I got.

janowicz commented 7 years ago

Cool, thanks! I'll work on this today

janowicz commented 7 years ago

Closing this issue for now thanks to #34. We can re-open as needed.