gazetteerhk / census_explorer

Explore Hong Kong's neighborhoods through visualizations of census data
http://gazetteer.hk
MIT License
42 stars 12 forks source link

Aggregated values from original xls #16

Open hupili opened 10 years ago

hupili commented 10 years ago

Redirection from:

https://github.com/hxu/hk_census_explorer/pull/14#discussion-diff-9333960

clacanzo commented 10 years ago

@hxu @hupili I am not sure if in the end you decided to keep the original "median" aggregation from the census website to be used in ours as well. Personally the numbers looked "too round" to me and I wonder if they are reliable. are there formulas in the original website or were these median values provided as raw data? if we decide to calculate the medians by ourselves, I think this is actually possible, contrary to what @hxu was saying in the other thread: it does require us to make some assumptions, since we don't have access the the original "person by person" data, but using weighted mean and mode applied to the different frequency blocks we can create formulas that I think could be representative. also, it gives us the advantage to show aggregated values as we want them and we judge them more significative for our purposes. if you want, I can help with this: I can just write formulas in text format, or show them in a google spreadsheet.....

let me know. of course, if we decide to take the original provided median as faster and more accurate, fine by me. after all, the census department should be expert in statistics, so they can't be too grossly off....

hupili commented 10 years ago

@clacanzo , the design decision is still open. I prefer to calculate from raw data, wherever possible.

Median, unlike mean, can be accurately calculated to an interval, e.g. 10,000 - 14,999. As to what exact value in that interval, I don't think even census department know... Maybe we can ask someone who had experienced the 2011 census. I suppose the questionaire asks incoming in terms of interval, rather than exact numbers (how awkward it would be to put down exact numbers..). If so, the raw median in original table is also estimated.

Besides, we don't need exact numbers for visualization purpose. Even if we have exact numbers, we may still want to quantise them into 10 intervals and pick 10 highly contrast colours on the map. A real-valued gradient plot does not entertain human eyes too much.

We don't know the formula used by census department but it looks their number is finer granualared than what we can calculated from the table. I suggest we plot something on the map first. If the contrast level is not enough, we can make another table to incorporate the aggregated data.

@clacanzo you can write down the formula you think, with examples in Google spreadsheet. It does not hurt to have a discussion if you have time.

I just scanned all the aggregate fields, only four types:

clacanzo commented 10 years ago

I am on it, @hupili I will do this today and send you something….

clacanzo commented 10 years ago

working on it, done quite a few already, you can follow what I do in this spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AnMgWbxp_0cVdG1BR1VnMEVBY2ZaX3dnZHNHUXNOVmc&usp=sharing