ShannonAI / mrc-for-flat-nested-ner

Code for ACL 2020 paper `A Unified MRC Framework for Named Entity Recognition`
643 stars 117 forks source link

Question regarding NER dataset transfer #82

Closed smiles724 closed 3 years ago

smiles724 commented 3 years ago

Hi, there is a big problem that I do not understand concerning the method you construct the MRC NER dataset. Specifically, I am confused about how to obtain the new start and end positions based on the original ones. `origin_offset2token_idx_start, origin_offset2token_idx_end = {}, {} for token_idx in range(len(tokens)): if type_ids[token_idx] == 0: continue

            token_start, token_end = offsets[token_idx]
            if token_start == token_end == 0:
                continue

            origin_offset2token_idx_start[token_start] = token_idx
            origin_offset2token_idx_end[token_end] = token_idx

        new_start_positions = [origin_offset2token_idx_start[start] for start in start_positions]
        new_end_positions = [origin_offset2token_idx_end[end] for end in end_positions]`

In the above code, you directly transfer old start and end positions to new ones using the dictionary of "origin_offset2token_idx_start" and "origin_offset2token_idx_end". However, I believe that when the concatenation of context and query is fed into the tokenizer, some tokens may contain more than one characters. Therefore, indices of entities in the original context are different from indices in tokenized context. Thus, you cannot utilize the offset indices to represent the entities. Can you please give me some ideas why you design this way and whether your implementation solved my question?

Great thanks!

YuxianMeng commented 3 years ago

@Wufang1997 Could you please provide an example that this preprocessing scripts would fail?

smiles724 commented 3 years ago

@Wufang1997 Could you please provide an example that this preprocessing scripts would fail? Hi, thanks for your reply. I think there is nothing wrong with your preprocessed dataset. You have already taken the impact of tokenizer into consideration. However, this process does not appear in your code, and that is what makes me very confused. If someone wishes to adopt your approach into his own custom dataset, he needs to write a small program to match the old tokens with the new ones.