PlanTL-GOB-ES / lm-biomedical-clinical-es

Official source for Spanish pretrained biomedical and clinical language models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
Apache License 2.0
25 stars 2 forks source link

Fine-tune it for multiclass or multilabel text classification #5

Open lpschaub opened 10 months ago

lpschaub commented 10 months ago

I have medical reports, and I try to predict the disease associated with each report : 1) both medical reports and disease to predict are written by humans -> mistakes, inconsistency in label names (same disease different ways to write it and reverse) 2) Should I use RobertaForSequenceClassifier or AutomodelForSequenceClassifier ? 3) what-s the best way to handle imperfect labels ? Embed them in the same roberta tokenized space, and predict the mean of the vector, or predict the whole vector (it becomes then a multilabel task). best