g-simmons / 289G_NLP_project_FQ2020

0 stars 1 forks source link

[HIGHEST] Check/fix BERT encodings so they align with entity_spans #34

Closed g-simmons closed 3 years ago

g-simmons commented 3 years ago

For each sample in our dataset (from the pickle file), elements in sample["entity_spans"] indicates consecutive locations of tokens that constitute entities.

If the tokenization provided by BERT does not align with the tokenization inherent to the BioInfer dataset, then these entity_spans will be incorrect for encodings provided by BERT (they will point at the wrong encodings).

We should first confirm that this is a problem. The pretrained BERT model may have some conveniences built-in to take care of this already, not sure.

If this is a problem, the proposed solution is to combine/edit the BERT encodings after obtaining them in the forward pass such that they align with what would be obtained using the from scratch embeddings and blstm layer.

g-simmons commented 3 years ago

50