Span in NER model error

patfol commented 3 years ago

The spans in the NER model are incorrect.

Code to reproduce:


from glob import glob
import pandas as pd
import re
from pprint import pprint
import pkg_resources

from pymedextcore.document import Document
from pymedext_eds.annotators import Endlines, SentenceTokenizer, SectionSplitter
from pymedext_eds.utils import rawtext_loader
from pymedext_eds.med import MedicationAnnotator, MedicationNormalizer

endlines = Endlines(["raw_text"], "clean_text", ID="endlines")
sections = SectionSplitter(['clean_text'], "section", ID= 'sections')
sentenceSplitter = SentenceTokenizer(["section"],"sentence", ID="sentences")

models_param = [{'tagger_path':'data/models/apmed5/entities/final-model.pt' ,
                'tag_name': 'entity_pred' },
                {'tagger_path':'data/models/apmed5/events/final-model.pt' ,
                'tag_name': 'event_pred' },
               {'tagger_path': "data/models/apmed5/drugblob/final-model.pt",
                'tag_name': 'drugblob_pred'}]

med = MedicationAnnotator(['sentence'], 'med', ID='med:v2', models_param=models_param,  device='cuda:1')

data_path = pkg_resources.resource_filename('pymedext_eds', 'data/romedi')
romedi_path = glob(data_path + '/*.p')[0]

norm = MedicationNormalizer(['ENT/DRUG','ENT/CLASS'], 'normalized_mention', ID='norm',romedi_path= romedi_path)

pipeline = [endlines,sections, sentenceSplitter, med, norm]

data_path = pkg_resources.resource_filename('pymedext_eds', 'data/demo')
file_list = glob(data_path + '/*.txt')

docs = [rawtext_loader(x) for x in file_list]

for doc in docs:
    doc.annotate(pipeline)

[t.value for t in docs[0].get_annotations('ENT/DRUG')]

docs[0].get_annotations('clean_text')[0].value[5687:5691]

marc-r-vincent commented 3 years ago

Likely not limited to NER, seems to also affect the quickumls annotator's output maybe due to pymedext referencing to something else than the raw text (use of quickumls as a standalone library does not display the same behavior) ?

code comparing searched terms and extracted strings using raw text and found spans, checking 0 and 1 indexing (place at the end of demo_pymedext_eds):

import numpy as np
ck = chunk[0]
raw_text = ck.raw_text()
annots = [annot.to_dict() for annot in chunk[0].get_annotations("umls")]
comps = []
is_equal_idx0 = []
is_equal_idx1 = []
for annot in annots:
    span = annot["span"]
    is_equal_idx0.append(
        annot["value"] == raw_text[(span[0]):(span[1])]
    )
    is_equal_idx1.append(
        annot["value"] == raw_text[(span[0]-1):(span[1]-1)]
    )
    comps.append((annot["value"], raw_text[(span[0]):(span[1])]))
print("- all terms and extracts equal (idx0 and idx1):", (np.all(is_equal_idx0), np.all(is_equal_idx1)))
print("- some terms and extracts equal (idx0 and idx1):", (np.any(is_equal_idx0), np.any(is_equal_idx1)))
print('- comparisons (idx0):', comps)

with output: `

all terms and extracts equal (idx0 and idx1): (False, False)
some terms and extracts equal (idx0 and idx1): (False, False)
comparisons (idx0): [('accident', 'cident d'), ('coagulation', 'agulation.\n'), ('intervention', 'ervention.\nM'), ('semi', 'mara'), ('infiltrations', 'rations d’aci'), ('palpation', 'on intern'), ('Grinding test', 'g test positi'), ('compression', 'n du genou '), ('rotation', 'nterne q'), ('est', 'reu'), ('consultation', 'avec le chir'), ('impotence', 'ctionnell'), ('consultation', 'chirurgien o'), ('plateau tibial', '.\n\nConduite à '), ('infiltration', 'e."\nLe 7 nov'), ('consultation', 'urgien ortho'), ('infiltrations', 'e il y a une '), ('cortisone', 'ine.\n\nIRM'), ('balance', 'les sui'), ('Prescription', 'n comprenant'), ('coagulation', 'uin."\nConse'), ('complication', "nécessité d'"), ('est', 'tra'), ('consultation', 'de l’experti'), ('Information', ' consultati'), ('consultation', 'ilan biologi'), ('notes', 'r 201'), ('coagulations', 'r 2015, cons'), ('consultation', 'esthésiste m'), ('accident', 'tion d’H'), ('questionnaire', 'patiente sont'), ('hospitalisation', 'ie ambulatoire)'), ('Arthroscopie', 'tomie latéra'), ('consultation', 'donnances de'), ('note', 'ait '), ('hospitalisation', 'ement anticoagu'), ('anticoagulant', ' aurait simpl'), ('prescription', 'el).\n\nLe 22 '), ('consultation', 'veineuse pro'), ('INNOHEP', ', contr'), ('INNOHEP', '4 de co'), ('PREVISCAN', 'e adaptée'), ('compression', 'Echodoppler'), ('thrombus', ' (9,5 mm'), ('consultation', 's. A débuté '), ('Cicatrices', 'accrochage'), ('palpation', 'al, bon r'), ('compression', ' à la reche'), ('probable', 'chodoppl'), ('immobilisation', 'opie du genou '), ('PREVISCAN', "de l'hémo"), ('syndrome', 're cet é'), ('Anticoagulation', 'e echodoppler à'), ('compression', '"Évolution '), ('recanalisation', 'ement par AVK '), ('hospitalisation', ' avec embolie p'), ('association', 'uin 2016), '), ('face', '')] `

aneuraz commented 3 years ago

@marc-r-vincent It's true. Currently spans are aligned on the preprocessed text from Endlines. If you have any idea of how to map the spans between raw_text and preprocessed_text, it would be great !

equipe22 / pymedext_eds

Span in NER model error #4