equipe22 / pymedext_eds

PyMedExt annotators for the EDS pipeline
Apache License 2.0
0 stars 1 forks source link

Span in NER model error #4

Open patfol opened 3 years ago

patfol commented 3 years ago

The spans in the NER model are incorrect.

Code to reproduce:


from glob import glob
import pandas as pd
import re
from pprint import pprint
import pkg_resources

from pymedextcore.document import Document
from pymedext_eds.annotators import Endlines, SentenceTokenizer, SectionSplitter
from pymedext_eds.utils import rawtext_loader
from pymedext_eds.med import MedicationAnnotator, MedicationNormalizer

endlines = Endlines(["raw_text"], "clean_text", ID="endlines")
sections = SectionSplitter(['clean_text'], "section", ID= 'sections')
sentenceSplitter = SentenceTokenizer(["section"],"sentence", ID="sentences")

models_param = [{'tagger_path':'data/models/apmed5/entities/final-model.pt' ,
                'tag_name': 'entity_pred' },
                {'tagger_path':'data/models/apmed5/events/final-model.pt' ,
                'tag_name': 'event_pred' },
               {'tagger_path': "data/models/apmed5/drugblob/final-model.pt",
                'tag_name': 'drugblob_pred'}]

med = MedicationAnnotator(['sentence'], 'med', ID='med:v2', models_param=models_param,  device='cuda:1')

data_path = pkg_resources.resource_filename('pymedext_eds', 'data/romedi')
romedi_path = glob(data_path + '/*.p')[0]

norm = MedicationNormalizer(['ENT/DRUG','ENT/CLASS'], 'normalized_mention', ID='norm',romedi_path= romedi_path)

pipeline = [endlines,sections, sentenceSplitter, med, norm]

data_path = pkg_resources.resource_filename('pymedext_eds', 'data/demo')
file_list = glob(data_path + '/*.txt')

docs = [rawtext_loader(x) for x in file_list]

for doc in docs:
    doc.annotate(pipeline)

[t.value for t in docs[0].get_annotations('ENT/DRUG')]

docs[0].get_annotations('clean_text')[0].value[5687:5691]
marc-r-vincent commented 3 years ago

Likely not limited to NER, seems to also affect the quickumls annotator's output maybe due to pymedext referencing to something else than the raw text (use of quickumls as a standalone library does not display the same behavior) ?

code comparing searched terms and extracted strings using raw text and found spans, checking 0 and 1 indexing (place at the end of demo_pymedext_eds):

import numpy as np
ck = chunk[0]
raw_text = ck.raw_text()
annots = [annot.to_dict() for annot in chunk[0].get_annotations("umls")]
comps = []
is_equal_idx0 = []
is_equal_idx1 = []
for annot in annots:
    span = annot["span"]
    is_equal_idx0.append(
        annot["value"] == raw_text[(span[0]):(span[1])]
    )
    is_equal_idx1.append(
        annot["value"] == raw_text[(span[0]-1):(span[1]-1)]
    )
    comps.append((annot["value"], raw_text[(span[0]):(span[1])]))
print("- all terms and extracts equal (idx0 and idx1):", (np.all(is_equal_idx0), np.all(is_equal_idx1)))
print("- some terms and extracts equal (idx0 and idx1):", (np.any(is_equal_idx0), np.any(is_equal_idx1)))
print('- comparisons (idx0):', comps)

with output: `

aneuraz commented 3 years ago

@marc-r-vincent It's true. Currently spans are aligned on the preprocessed text from Endlines. If you have any idea of how to map the spans between raw_text and preprocessed_text, it would be great !