Closed smiles724 closed 3 years ago
@Wufang1997 Could you please provide an example that this preprocessing scripts would fail?
@Wufang1997 Could you please provide an example that this preprocessing scripts would fail? Hi, thanks for your reply. I think there is nothing wrong with your preprocessed dataset. You have already taken the impact of tokenizer into consideration. However, this process does not appear in your code, and that is what makes me very confused. If someone wishes to adopt your approach into his own custom dataset, he needs to write a small program to match the old tokens with the new ones.
Hi, there is a big problem that I do not understand concerning the method you construct the MRC NER dataset. Specifically, I am confused about how to obtain the new start and end positions based on the original ones. `origin_offset2token_idx_start, origin_offset2token_idx_end = {}, {} for token_idx in range(len(tokens)): if type_ids[token_idx] == 0: continue
In the above code, you directly transfer old start and end positions to new ones using the dictionary of "origin_offset2token_idx_start" and "origin_offset2token_idx_end". However, I believe that when the concatenation of context and query is fed into the tokenizer, some tokens may contain more than one characters. Therefore, indices of entities in the original context are different from indices in tokenized context. Thus, you cannot utilize the offset indices to represent the entities. Can you please give me some ideas why you design this way and whether your implementation solved my question?
Great thanks!