eonum / medtextcollector

Scripts for the collection of online medical texts and definitions
MIT License
1 stars 0 forks source link

web crawler with medical text classifier #3

Closed tschimbr closed 7 years ago

tschimbr commented 7 years ago

Use an existing generic web crawler or implement a new one. Use the probability of a text being a medical text for the selection of which links should be followed. This, of course, requires a binary medical text classifier. Such a classifier can be built using sample texts obtained by #1

tschimbr commented 7 years ago

Step 1: add initial page to the managed list with pages ordered by p(page is medical) in descending order. At each step: Get all pages/texts linked in the first page of the managed list, assign probabilities to all those pages with the medical text classifier and add them to the list. Store pages/texts with p(this is a medical text) > 0.8 to a persistent storage.

tschimbr commented 7 years ago

Crop the managed list at N entries after each step.

tschimbr commented 7 years ago

But of course maintain a list with visited links that should not be visited again. And a list with hashes of the obtained texts in order to detect double entries.

asittampalam commented 7 years ago

Partially supervised learning: https://www.cs.uic.edu/~liub/S-EM/unlabelled.pdf

asittampalam commented 7 years ago

nltk.classify.positivenaivebayes