dsfsi / covid19za

Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa
https://dsfsi.github.io/covid19za-dash/
MIT License
255 stars 200 forks source link

[DATA] population statistics error #436

Open lethabo24 opened 4 years ago

lethabo24 commented 4 years ago

Which Dataset

The za_province_pop

Error Description

The Gauteng and NorthWest populations do not correspond to the National Statistics PDF document

Suggested fixes

1. 1. 1.

shaze commented 4 years ago

Thanks -- I see digits transposed in the Northwest figures, but can't see the problem in Gauteng: 15176115 corresponds to Figure 1 on page vi. Could you elaborate please.

lethabo24 commented 4 years ago

Good Day

The SA Stats document in the appendix notes the Gauteng Population as 15176116.

Kind Regards Lethabo Maluleke

-------- Original message -------- From: Scott Hazelhurst notifications@github.com Date: Thu, Jun 11, 2020, 10:55 AM To: dsfsi/covid19za covid19za@noreply.github.com Cc: "Maluleke, LM, Miss [18306063@sun.ac.za]" 18306063@sun.ac.za, Author author@noreply.github.com Subject: Re: [dsfsi/covid19za] [DATA] (#436) CAUTION: This email originated from outside of the University. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Thanks -- I see digits transposed in the Northwest figures, but can't see the problem in Gauteng: 15176115 corresponds to Figure 1 on page vi. Could you elaborate please.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdsfsi%2Fcovid19za%2Fissues%2F436%23issuecomment-642509193&data=02%7C01%7C%7C721b86de95de4992d18d08d80de514c1%7Ca6fa3b030a3c42588433a120dffcd348%7C0%7C0%7C637274625321315354&sdata=smzTrm1zlrqkJpM0gTDQUT3TStoygan11dBZmHhuMm0%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKAA2Q5TLAEZJLMH3GV6E5DRWCLVBANCNFSM4N2YYO3Q&data=02%7C01%7C%7C721b86de95de4992d18d08d80de514c1%7Ca6fa3b030a3c42588433a120dffcd348%7C0%7C0%7C637274625321315354&sdata=CkW5GG1mGnriPwFvKdlaJbDeKV9eqhsuiQgGuQwxL%2B8%3D&reserved=0.

[https://cdn.sun.ac.za/100/ProductionFooter.jpg]https://www.sun.ac.za/english/about-us/strategic-documents

The integrity and confidentiality of this email are governed by these terms. Disclaimerhttps://www.sun.ac.za/emaildisclaimer Die integriteit en vertroulikheid van hierdie e-pos word deur die volgende bepalings bereël. Vrywaringsklousulehttps://www.sun.ac.za/emaildisclaimer

vukosim commented 4 years ago

Thanks @18306063

@shaze we also have the statssa midyear estimates now in the staging area folder. We might want to just make a choice on where to put that, maybe

data/official_statistics/

shaze commented 4 years ago

OK -- the one in data/district_data has been there longer so there may be scripts dependant on it. But easy to change so it is more important to have it in the right logical place so I have no objection moving or replacing it

But if using the new file I think needs to be made program friendly -- if you read in with Pandas it seems the columns as text by default, and even harder to handle if not using Pandas

vukosim commented 4 years ago

@elolelo Can you comment.

shaze commented 4 years ago

Hi Lethabo

Thanks -- it seems that they've slightly contradictory figures in the same document. Fortunately only off by 1 so way below any error mark (also adding the provincial figures does not give the total figure so we can't check that way to find which is correct)

The NW error is definitely wrong. Will push with today's figures

Will fix and push in few minutes

elolelo commented 4 years ago

So, the Gauteng value in question can be found in this file , a breakdown of that figure can be found on this one

elolelo commented 4 years ago

@elolelo Can you comment.

I am not sure to what extent are these new files program friendly. They may be changed if necessary.

shaze commented 4 years ago

Thanks. Ideally they must be computer-readable -- Pandas is the most flexible so readable by Pandas is essential.

Also for the age break down file, I think having 5 provinces followed by 4 provinces is very difficult fo a computer to follow.

Two possible formats are below. My preference would be for 1 though 2 is what we're doing in other places and may be more human friendly.

  1. Column-wise

Have columns: province, age group, male, female, total

Province is repeated

  1. Row-wise

Using the same format that we're using for keys Have 27 columns, 3 for each province Eastern Cape\tMales,Eastern Cape\tFemales,Eastern Cape\tTotal,Free State\tMales,......

Note using the same convention as we do for district -- spaces separating words in names of provinces and tabs separating the name of the province from the category. This approach is very readable in GitHub, but programs can parse easily and using the convention of tabs separating the province name from the category means that

Final point -- I note in several places that the total is not equal to the sum of males and females. I doubt that these figures were done at time where non-binary categories were allowed so they are likely to be errors (in the source document). It might be worth pointing this out in the README. The discrepancy is so small as to be inconsequential for any work being done.

Many thanks for all this work -- it is very helpful