brezniczky / drdigit

A digit doctoring detection package
GNU General Public License v3.0
0 stars 0 forks source link

Effort to eliminate ward count/vote count bias from scoring #10

Open brezniczky opened 5 years ago

brezniczky commented 5 years ago

Scores will very likely have different expected values and variances depending on the vote counts and the number of wards. This should be reduced on.

Plan (note to self)

I. for vote count only

Initial plan (plan to scrap and rework iteratively ... 1) model the distribution of vote counts (fit power curve? not a panacea see registered voters) 2) determine simulated mean as a function of the vote count 3) examine/explore how the mean varies (e.g. how big are the differences ...) 4) compensate for it

--- etc. ---

II. largely repeat with ward count

III. go for the joint distro

brezniczky commented 5 years ago

After a sleep, possibly not an issue.

brezniczky commented 5 years ago

Okay, I think I know what I'm confused by (at the very least). Different aspect though and does not address the previous, but this one can be addressed at least.

I am treating each entropy probability as if they were equally certain.

Now that is however, a little rough-cut: the larger the sample (higher the ward count), the more reliable the estimates are likely to be, at least on the grounds that the more the wards, the 1) larger is the sample (-> something CLT-like could be applicable?) 2) smaller the change one more observation (ward) could induce in the probability estimate, or the smaller the impact an outlier value could have caused

I guess this would result in if thought through some GLS-like downweighting of the small municipality results.

Fortunately (at least for Hungarian data) the large wards like to be in the top, they're also like to be in the top on theoretical grounds - the smallest probabilities are derivable for the largest ward counts, as their possible min prob is the lowest with avoid_inf=True.