drennings / CareFinder

Project for the course IN4325 Information Retrieval (Web Information Systems @ Delft University of Technology), assignment by myTomorrows
1 stars 3 forks source link

P1: Hospital and Doctor Collection - Scrape doctor names and their specialization given a hopsital URL #26

Open drennings opened 7 years ago

drennings commented 7 years ago

Use e.g. Scrapy and the Named Entity Recognition Tool to detect names. Please note that your implementation should be open for extending to also scrape the text around doctor names to use this data in determining their specialization.

Also write accompanying documentation.

mjuchli commented 7 years ago

As it turned out, according to @Krymnos, finding doctors on websites is a hard task. Various difficulties such as dynamically loaded content (with JavaScript) resulted in an unsatisfying ratio of doctor names per URL.

Therefore, we decided to try another approach. Northwell Health (https://www.northwell.edu/find-care/find-a-doctor) provides a listing of doctors on location basis too. For the search term "New York", there are 6'523 doctors listed. The website is represented by a specific site structure for every doctor. Thus, we were able to use XPath to retrieve the demanded content: Doctor Name, Speciality and Residencies of the doctor. The task (https://github.com/drennings/CareFinder/blob/master/P1/northwell.py) involved to first overcome the pagination of the doctor listing, then then crawl specific content from every doctor's detail page. As a result, we are able to collect all of the six thousand plus doctors from New York.

mjuchli commented 7 years ago

Since we were still missing the locations from the doctors – we only had hospital name – there are several approaches to retrieve latitude and longitude. Assuming if the address is known, then we can do a google places api call (e.g. using geopy). The same thing could be done when the hospital is name is known, of course, but as a matter of fact we reach the api limit pretty quickly with classical place search lookups. Therefore I came up with an approach where we can use our previously collected data from Foursquare and Google Places (see #24): given the hospital names and their locations, we can build up an "index" to which we can try to map other hospital names to –in case the location is unknown. Specifically, the similarity measurement can be done with levenshtein or ratcliffObershelp, and the most similar (highest ratio) would be the most suitable mapped hospital.