Open patfol opened 3 years ago
Likely not limited to NER, seems to also affect the quickumls annotator's output maybe due to pymedext referencing to something else than the raw text (use of quickumls as a standalone library does not display the same behavior) ?
code comparing searched terms and extracted strings using raw text and found spans, checking 0 and 1 indexing (place at the end of demo_pymedext_eds):
import numpy as np
ck = chunk[0]
raw_text = ck.raw_text()
annots = [annot.to_dict() for annot in chunk[0].get_annotations("umls")]
comps = []
is_equal_idx0 = []
is_equal_idx1 = []
for annot in annots:
span = annot["span"]
is_equal_idx0.append(
annot["value"] == raw_text[(span[0]):(span[1])]
)
is_equal_idx1.append(
annot["value"] == raw_text[(span[0]-1):(span[1]-1)]
)
comps.append((annot["value"], raw_text[(span[0]):(span[1])]))
print("- all terms and extracts equal (idx0 and idx1):", (np.all(is_equal_idx0), np.all(is_equal_idx1)))
print("- some terms and extracts equal (idx0 and idx1):", (np.any(is_equal_idx0), np.any(is_equal_idx1)))
print('- comparisons (idx0):', comps)
with output: `
@marc-r-vincent It's true. Currently spans are aligned on the preprocessed text from Endlines. If you have any idea of how to map the spans between raw_text and preprocessed_text, it would be great !
The spans in the NER model are incorrect.
Code to reproduce: