bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

Wrong entity offsets in the tmvar_v3 datasets #873

Closed WangXII closed 1 year ago

WangXII commented 1 year ago

Describe the bug

Some entities in the tmvar_v3 dataset have wrong entity offsets

Steps to reproduce the bug

Python Code ```python from datasets import load_dataset dataset_name= "tmvar_v3" dataset = load_dataset(f"bigbio/{dataset_name}", name=f"{dataset_name}_bigbio_kb") doc_id = "21904390" def check_offsets(doc_id): text = dataset[split].filter(lambda x: x["document_id"] == doc_id)[0]["passages"][0]["text"][0] \ + " " + dataset[split].filter(lambda x: x["document_id"] == doc_id)[0]["passages"][1]["text"][0] sentences = text.split(". ") sentence_indexes = [m.start() + 2 for m in re.finditer("\. ", text)] # because the suffix is ". " sentence_indexes = [0] + sentence_indexes doc_entities = dataset[split].filter(lambda x: x["document_id"] == doc_id)[:]["entities"][0] print(sentence_indexes) print(len(sentences)) print(text) print(doc_entities) sentence_index = 0 entity_index = 0 current_offset = 0 next_sentence_offset = 0 next_entity_offset = text[current_offset:].find(doc_entities[entity_index]["text"][0]) while True: if sentence_index >= len(sentence_indexes) and entity_index >= len(doc_entities): break if next_sentence_offset <= next_entity_offset: sentence_end = sentence_indexes[sentence_index] + len(sentences[sentence_index]) + 2 print(f"Sentence {sentence_index} @ offsets {sentence_indexes[sentence_index]} to {sentence_end}") print(sentences[sentence_index] + ". ") sentence_index += 1 if sentence_index >= len(sentence_indexes): next_sentence_offset = len(text) else: next_sentence_offset = sentence_indexes[sentence_index] # print(f"DEBUG next_sentence_offset: {next_sentence_offset}") else: # next_entity_offset < next_sentence_offset entity = doc_entities[entity_index] entity_name = entity["text"][0] given_offset_start = entity["offsets"][0][0] given_offset_end = entity["offsets"][0][1] print(f" {entity_name} @ offsets (real) {next_entity_offset} to {next_entity_offset + len(entity_name)}") print(f" {text[given_offset_start:given_offset_end]} @ offset (given) {given_offset_start} to {given_offset_end}") current_offset = next_entity_offset + len(entity_name) entity_index += 1 if entity_index >= len(doc_entities): next_entity_offset = len(text) else: next_entity_offset = current_offset + text[current_offset:].find(doc_entities[entity_index]["text"][0]) # print(f"DEBUG next_entity_offset: {next_entity_offset}") check_offsets(doc_id) ```

Expected results

In Pubmed ID 21904390, expected offsets for these seven entities "entity name, (offset_start, offset_end)" are as follows:

PAX6 (342, 346)
PAX6 (751, 755)
PAX6 (1153, 1157)
PAX6 (1483, 1487)
PAX6 (1627, 1631)
DKFZ p686k1684 (1640, 1654)
PAX6 (2037, 2041).

Actual results

Offsets in the tmvar_v3 dataset for the seven entities are as follows:

PAX6 (343, 347)
PAX6 (753, 757)
PAX6 (1156, 1160)
PAX6 (1487, 1491)
PAX6 (1631, 1635)
DKFZ p686k1684 (1645, 1659)
PAX6 (2043, 2047)

All the remaining entity offsets in the tmvar_v3 dataset seem to be correct.