For each sample in our dataset (from the pickle file), elements in sample["entity_spans"] indicates consecutive locations of tokens that constitute entities.
If the tokenization provided by BERT does not align with the tokenization inherent to the BioInfer dataset, then these entity_spans will be incorrect for encodings provided by BERT (they will point at the wrong encodings).
We should first confirm that this is a problem. The pretrained BERT model may have some conveniences built-in to take care of this already, not sure.
If this is a problem, the proposed solution is to combine/edit the BERT encodings after obtaining them in the forward pass such that they align with what would be obtained using the from scratch embeddings and blstm layer.
For each sample in our dataset (from the pickle file), elements in
sample["entity_spans"]
indicates consecutive locations of tokens that constitute entities.If the tokenization provided by BERT does not align with the tokenization inherent to the BioInfer dataset, then these entity_spans will be incorrect for encodings provided by BERT (they will point at the wrong encodings).
We should first confirm that this is a problem. The pretrained BERT model may have some conveniences built-in to take care of this already, not sure.
If this is a problem, the proposed solution is to combine/edit the BERT encodings after obtaining them in the forward pass such that they align with what would be obtained using the from scratch embeddings and blstm layer.