Sometimes the smart span matching matches a really long string, and all/many available substrings. This then uses a lot of resources for mostly bad matches.
E.g. the list of genes in the 'figure 1' caption in this article:
Can we switch to using beam search on the predicted BIO-labels, or some other approach, especially since most of these available sub-spans are getting picked up by the ExplosionNERStep when actually relevant?
also:
as per conversation with @wonjininfo , we need to make some changes to use the new multi label tinyber2 classifier:
def get_softmax_predictions(self, loader: DataLoader) -> Tensor:
""" get a namedtuple_values_indices consisting of confidence and labels for a given dataloader (i.e. run bert)
:param loader:
:return: """
results = torch.cat( [ x.logits for x in self.trainer.predict( model=self.model, dataloaders=loader, return_predictions=True )]
) # return logits here # softmax = self.softmax(results)
# get confidence scores and label ints # confidence_and_labels_tensor = torch.max(softmax, dim=-1)
return results
set smartspan threshold to 0.0 (negative logits are negative results)
Sometimes the smart span matching matches a really long string, and all/many available substrings. This then uses a lot of resources for mostly bad matches.
E.g. the list of genes in the 'figure 1' caption in this article:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4589645/
Can we switch to using beam search on the predicted BIO-labels, or some other approach, especially since most of these available sub-spans are getting picked up by the ExplosionNERStep when actually relevant?
also:
as per conversation with @wonjininfo , we need to make some changes to use the new multi label tinyber2 classifier:
set smartspan threshold to 0.0 (negative logits are negative results)
fix smart span processor to detect spans properly