ThisIsIsaac / Data-Science-for-COVID-19

COVID-19 Korea Dataset & Comprehensive Medical Dataset & visualizer
278 stars 37 forks source link

Clarification about some observations in PatientRoute #18

Closed rickypinci closed 4 years ago

rickypinci commented 4 years ago

Hi, thanks for collecting, organizing, and sharing this data. I am opening this Issue for asking for clarification about some observations in the PatientRoute data set.

1) Why some location types in the PatientRoute data set are labeled as "etc"? How those values should be interpreted? 2) Checking some coordinates with Google Maps, it seems that they do not point to a specific building/location. Some of them point to the middle of the street or to a crossing. For example, coordinates "35.235299, 128.670257" (patient_id = 6100000088, on March 16, 2020) point to a crossing/road. Is it right? How these coordinates have been obtained? 3) Is it possible that some locations are labeled wrongly? For example, coordinates "37.4562557, 126.7052062" are labeled as "airport", "etc", "restaurant", and "public_transportation". However, checking with Kakao Map or Naver Map, that location seems to be a "City Hall" (and it is far from airports). 4) There is at least a patient (3001000003) in the PatientRoute data set that seems to go to different hospitals for one week. However, when checking the coordinates on map services, only one of those locations is a hospital (37.818481, 128.857753), while other locations are far from hospitals. 5) Why there are only 1472 patients in the PatientRoute data set, while there are 4004 patients in the PatientInfo one? These values have been obtained from the last update.

Thanks for your help and your time.

ThisIsIsaac commented 4 years ago

Hey @rickypinci,

  1. Why some location types in the PatientRoute data set are labeled as "etc"? How those values should be interpreted?

We use automated code and rule-based methods to give "Classes" (i.e. restaurant, beauty salon ...), and sometimes there are outliers that our rules do not catch. Those are labeled as "etc".

  1. Checking some coordinates with Google Maps, it seems that they do not point to a specific building/location. Some of them point to the middle of the street or to a crossing. For example, coordinates "35.235299, 128.670257" (patient_id = 6100000088, on March 16, 2020) point to a crossing/road. Is it right? How these coordinates have been obtained?

We parse location key words ( for example, data is initially collected in natural language: Patient 1000 visited McDonalds near Gangnam station exit 3, then visited a nearby barber shop before going back to his home by 3pm.) We use rule-based ways to parse keywords from these natural language data, and due to the difficulty of parsing these sentences, written by many different people with no unifying rule, the parsed key words (ex. McDonalds near Gangnam station exit 3) are often not perfect. Since Google Map API has pretty robust search capabilities, most of the messy keywords yield acceptable results. However, as you may have observed, it is error-prone.

  1. Is it possible that some locations are labeled wrongly? For example, coordinates "37.4562557, 126.7052062" are labeled as "airport", "etc", "restaurant", and "public_transportation". However, checking with Kakao Map or Naver Map, that location seems to be a "City Hall" (and it is far from airports).

Definitely.

  1. There is at least a patient (3001000003) in the PatientRoute data set that seems to go to different hospitals for one week. However, when checking the coordinates on map services, only one of those locations is a hospital (37.818481, 128.857753), while other locations are far from hospitals.

I'll check patient 3001000003 and get back to you.

  1. Why there are only 1472 patients in the PatientRoute data set, while there are 4004 patients in the PatientInfo one? These values have been obtained from the last update.

Due to shortage of manpower and the lack of data availability.

That said, we're hiring AMTs to go through the data, catch errors, and generate data (instead of the old error-prone rule-based method). We're hoping to fix majority of such errors you've pointed out.

Your observation and insights seem to be very keen and are extremely helpful. We'd love to hear more feedback from you, if possible. Nexttime, you can email me directly, since I am more responsive to emails :)

ThisIsIsaac commented 4 years ago

Closing due to inactivity. Reopen if you have further questions.