dtandev / coronavirus

2020 Poland coronavirus data (COVID-19 / 2019-nCoV)
MIT License
19 stars 7 forks source link

SIMC designators for cities #2

Closed not7cd closed 5 months ago

not7cd commented 4 years ago

Using only the province and city name will result in misleading data. There are cases where one city name can occur multiple times in the province

        WOJ  POW  GMI  RODZ_GMI  RM  MZ     NAZWA     SYM  SYMPOD     STAN_NA
68609    20    2    9         5   1   1  Sobolewo   41619   41619  2020-01-01
69059    20   13    4         2   1   1  Sobolewo  398735  398735  2020-01-01
69876    20   12    7         2   1   1  Sobolewo  769143  769143  2020-01-01
72498    22    1    3         2   0   1  Sobolewo  742374  742339  2020-01-01
96850    30    2    2         2   0   1  Sobolewo  524884  524878  2020-01-01
100188   32    2    4         2   2   1  Sobolewo  182484  182484  2020-01-01

http://eteryt.stat.gov.pl/eTeryt/rejestr_teryt/informacje_podstawowe/informacje_podstawowe.aspx

dtandev commented 4 years ago

True, but if you don't have the full address of the infected patient, you cannot connect him with correct Sobolewo anyway. In the news, you can see only information about Sobolewo (if you are lucky). If you have a better idea for cases like that, we will consider it.

not7cd commented 4 years ago

I could start with adding WOJ for Provinces, I have some code that deals with that in a notebook here: https://github.com/not7cd/covid19-poland-data This will add new column to your work, but will simplify further analysis.

dtandev commented 4 years ago

Identification of the correct village is not the problem for the geopy library if you have the full address (village name and postal code could be enough). The problem is: "how can you find out in which Sobolewo lives the infected patient?" There is no problem with code. There is a problem with the detailed data that can be obtained in a very short period of time (24-36h).

Can you give us the correct SYMPOD code based on a bellowed newspaper article? https://www.se.pl/bialystok/koronawirus-dotarl-na-podlasie-zarazony-mezczyzna-to-lider-zespolu-black-metalowego-aa-CkDj-cz3K-F6Np.html

not7cd commented 4 years ago

Here we have a clue in the text: Białystok. So this will count to pow. białostocki. But I guess there will be more duplicates. I wonder if we can throw random NLP on this. Maybe just detecting a bigger city in proximity will improve this. I don't follow the news closely, but here is a mention of an institution that does testing. That also could be leveraged. I need to think about it.

not7cd commented 4 years ago

I will prepare a script that adds eTeryt columns to unambiguous records. Then we can resolve conflicting ones.

dtandev commented 4 years ago

Yes, we have, but.. you know... that was the first patient from podlaskie province. It was very easy, because there were a lot of pieces of information :) We have 2 patients from Sobolewo in our database. The second case was defined based on that: "Jak poinformowało Polskie Radio Białystok, druga ofiara koronawirusa na Podlasiu to dziecko „pacjenta zero". So your NLP app has to understand "who is 'pacjent zero'?, what is a child? and has to know that "children and parents live together if parents are not too old". :-) For you, this sentence is easy to understand. But for NLP...

But... before it has happened, we had something like that: https://www.radio.bialystok.pl/wiadomosci/index/id/180784 That was the second patient in podlaskie. Do you see some difficult problems for NLP? We have two "second patient" from 2 different cities. Which information is correct?*

Full information about the third patient: "Trzecia osoba zakażona na Podlasiu to kobieta w sile wieku z powiatu białostockiego z kwarantanny domowej, która wróciła z Wielkiej Brytanii." Thats all what we know...

*One day after this publication, the second medical test excluded coronavirus. The woman was healthy. So your NLP has to find out that and remove the patient from the database.

dtandev commented 4 years ago

I will prepare a script that adds eTeryt columns to unambiguous records. Then we can resolve conflicting ones.

I am afraid that no. We not. You can if you want.

Currently, the database is created by 5-6 people. They do that in their free time and don't get money. They have to find out specific information about 200-300 patients per day. Now. In the next week, it could be 500 cases. Do you really think that they have time to look for eTerryt of cities which they find? They don't. You have access to their database. You can use it for your own projects. Try to be more grateful and don't give them extra work. Thanks :)

not7cd commented 4 years ago

I am afraid that no. We not. You can if you want.

[...] You have access to their database. You can use it for your own projects. Try to be more grateful and don't give them extra work. Thanks :)

Sorry for the misunderstanding. Your current work on it is outstanding. By "we" I meant your collaboration and comments. If I can come up with anything, it shouldn't add more friction.

dtandev commented 4 years ago

Sorry for the misunderstanding. Your current work on it is outstanding. By "we" I meant your collaboration and comments. If I can come up with anything, it shouldn't add more friction.

I saw your map. :+1: Could you generate lists of counties in mazowieckie province with the numbers of confirmed infections for every day since 4.03? I found good data sources, but it hard compares it with our database without summaries statistics for counties.

not7cd commented 4 years ago

It's here https://github.com/not7cd/covid19-poland-data/tree/master/data-daily-pow

Just df[df["WOJ"] == 14]

I will dump my notebooks when I clean it up and source used data.