drennings / CareFinder

Project for the course IN4325 Information Retrieval (Web Information Systems @ Delft University of Technology), assignment by myTomorrows
1 stars 3 forks source link

P1 Data cleaning- mapping scraped specializations to our set of specializations #28

Open MatGar0 opened 7 years ago

MatGar0 commented 7 years ago

Describe how the result data of P1 was processed to map the specializations scraped for doctors to our set of specializations from #21

MatGar0 commented 7 years ago

Here i describe a quick process step that we apply to map the specializations scraped in P1 to the ones that we identified. The first step is splitting the specializations based on comma, since in the file this is the separator if there are multiple ones. Then we furher split them if any of them contain"\" in the way below: "Child\Adult Dentistry" -> "Child Dentistry", "Adult Dentistry". Note that we would like to map each row of doctor with all his specializations into multiple rows for the same doctor with only a single specialization, this way, we can later easily query the system

Then, for each of the specializations we check if there is a direct match to any of our specializations and if this is the case then we add such row to the final dataset. If there is no direct match, then we use the search engine bult in the #23 and get the probabilistic distribution of the specializations. We add the row for the most probable specialization, however, if any other specialization has probability of at least 25% we add it as well (this indicates a very close specialization mostly a parent specialization, we dont get more specific because the more specific are less probable).

The output file contains the dataframe with each row containing single doctor with single specialization and any other input attributes