Open asittampalam opened 7 years ago
In order to do this we will label top level domains as scientific / pseudo-scientific / trivial by an expert (medical doctor, coder..)
Use these two sets in order to create a translation service from professional scientific texts to non scientific texts easily understandable by patients.
Labeled as scientific (to be updated): springer.com pharmazeutische-zeitung.de springermedizin.at med2click.de pathologie-online.de clinicum.at
Labeled as non-scientific (to be updated): netdoktor.de diabetes-ratgeber.net planet-wissen.de focus.de spektrum.de gesundheitsinformation.de medizin-transparent.at haut-ratgeber.ch
Maybe we could use something like https://link.springer.com/article/10.1023%2FA%3A1007692713085?LI=true (Text Classification from Labeled and Unlabeled Documents using EM - I haven't read it yet) in order to start with a small labeled set (e.g. part of "scientific") and to use a large unlabeled set (e.g. "scientific" + "non-scientific") as leverage in order to learn a stable "scientific"/"non-scientific" classifier.
Create a data set with pairs of synonyms, one being scientific, the other being non-scientific:
Maybe test out in https://github.com/eonum/medword
In a second step we could extract the "scientific" medical documents from our positive set.