P1 Data cleaning- mapping scraped specializations to our set of specializations

Here i describe a quick process step that we apply to map the specializations scraped in P1 to the ones that we identified. The first step is splitting the specializations based on comma, since in the file this is the separator if there are multiple ones. Then we furher split them if any of them contain"\" in the way below: "Child\Adult Dentistry" -> "Child Dentistry", "Adult Dentistry". Note that we would like to map each row of doctor with all his specializations into multiple rows for the same doctor with only a single specialization, this way, we can later easily query the system

Then, for each of the specializations we check if there is a direct match to any of our specializations and if this is the case then we add such row to the final dataset. If there is no direct match, then we use the search engine bult in the #23 and get the probabilistic distribution of the specializations. We add the row for the most probable specialization, however, if any other specialization has probability of at least 25% we add it as well (this indicates a very close specialization mostly a parent specialization, we dont get more specific because the more specific are less probable).

The output file contains the dataframe with each row containing single doctor with single specialization and any other input attributes

drennings / CareFinder

P1 Data cleaning- mapping scraped specializations to our set of specializations #28