P7 build the final pipeline used by user to query doctors

Below a description of how the final pipeline is constructed, i.e. how all separate functions/scripts/tools/data is used to get from a user query to a suggested list of doctors + hospital + paper rank within the user defined range.

User inputs query: The user runs loads the library "doctor_finder" which includes all necessary functions for the final pipeline. The function "find_doctor()" asks the user to define: 1.1 a problem, 1.2 a location, 1.3 a range in km.
Mapping query to needed specialization: The problem defined in 1.1 is put into the function "spec_search()" which measures the word similarity of the query to the corpus of each of the 90 predefined specializations. The corpus is build up trough 3 layers of Wikipedia content. Layer 1 exists of the (full) lemmatized specialism Wikipedia page. Layer 2 consists of the first section of the Wikipedia page of each hyperlinked word in layer 1. Layer 3 consists of the hyper liked words of each Wikipedia page of a hyperlinked word in layer 2. All layers are individually written to an index saved as pickle and loaded when calling the "spec_search()" function. The specialization with the highest similarity is chosen as the user needed specialization.
Mapping needed specialization to doctors & hospitals (+lon & lat) Doctors and corresponding hospitals are acquired trough crawling metadata and hospital websites which are acquired through Google places and crowd sourcing. With named entity recognition and predefined titles such as "md." or "doctor" doctor names are acquired. By looking at the surrounding text (2 sentence before and after) and matching this to specializations (with Levenshtein distance), we now know the doctor + specialization + hospital. When the specialization is unknown, content of papers the doctor wrote is used in the function "spec_search()" to find a specialization for the doctor. Hospital names from metadata websites are mapped to hospitals of Google (including longitude and latitude) places by Levenshtein distance. A selection of the doctors is made based on correspondence of the doctor's specialization and the user needed specialization.
Filtering doctors for location and ranking on #papers written From the list of doctors with the right specialization for the user, we calculate the geographical distance from the user defined location to the hospital the doctor is working. The distance is calculaded by using GeoPy to get lon & lat of the user location and calculating the distance to the hospital's lon & lat which use of the Haversine formula. For all the doctors, we counted the number of papers they wrote by using the Pubmed API (BioPython). To increase the probability of finding the right doctor we search papers for the author name (surname + initials) AND affiliation city. Then we rank the doctors on paper count and look top down in the doctor selection (with the right specialization) to find doctors within the predifend range. The 5 doctors with highest paper count, within the range, with the right specialization will be shown to the user.

drennings / CareFinder

P7 build the final pipeline used by user to query doctors #29