Categorize "scientific"/"non scientific" medical documents

eonum / medtextcollector

Scripts for the collection of online medical texts and definitions

MIT License

1 stars 0 forks source link

Categorize "scientific"/"non scientific" medical documents #4

Open asittampalam opened 7 years ago

asittampalam commented 7 years ago

In a second step we could extract the "scientific" medical documents from our positive set.

tschimbr commented 7 years ago

In order to do this we will label top level domains as scientific / pseudo-scientific / trivial by an expert (medical doctor, coder..)

tschimbr commented 7 years ago

Use these two sets in order to create a translation service from professional scientific texts to non scientific texts easily understandable by patients.

asittampalam commented 6 years ago

Labeled as scientific (to be updated): springer.com pharmazeutische-zeitung.de springermedizin.at med2click.de pathologie-online.de clinicum.at

Labeled as non-scientific (to be updated): netdoktor.de diabetes-ratgeber.net planet-wissen.de focus.de spektrum.de gesundheitsinformation.de medizin-transparent.at haut-ratgeber.ch

asittampalam commented 6 years ago

Maybe we could use something like https://link.springer.com/article/10.1023%2FA%3A1007692713085?LI=true (Text Classification from Labeled and Unlabeled Documents using EM - I haven't read it yet) in order to start with a small labeled set (e.g. part of "scientific") and to use a large unlabeled set (e.g. "scientific" + "non-scientific") as leverage in order to learn a stable "scientific"/"non-scientific" classifier.

tschimbr commented 5 years ago

Create a data set with pairs of synonyms, one being scientific, the other being non-scientific:

calculate the vector embedding difference between these synonyms
How different are the differences?
average the differences
translate new scientific words to non-scientific words

Maybe test out in https://github.com/eonum/medword