AstraZeneca / KAZU

Fast, world class biomedical NER
https://AstraZeneca.github.io/KAZU/
Apache License 2.0
76 stars 8 forks source link

Smart span matching #26

Open RichJackson opened 4 months ago

RichJackson commented 4 months ago

Sometimes the smart span matching matches a really long string, and all/many available substrings. This then uses a lot of resources for mostly bad matches.

E.g. the list of genes in the 'figure 1' caption in this article:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4589645/

Can we switch to using beam search on the predicted BIO-labels, or some other approach, especially since most of these available sub-spans are getting picked up by the ExplosionNERStep when actually relevant?

also:

as per conversation with @wonjininfo , we need to make some changes to use the new multi label tinyber2 classifier:

    def get_softmax_predictions(self, loader: DataLoader) -> Tensor: 
           """ get a namedtuple_values_indices consisting of confidence and labels for a given dataloader (i.e. run bert) 
                :param loader: 
               :return: """ 
               results = torch.cat( [ x.logits for x in self.trainer.predict( model=self.model, dataloaders=loader, return_predictions=True )] 
               ) # return logits here # softmax = self.softmax(results) 
               # get confidence scores and label ints # confidence_and_labels_tensor = torch.max(softmax, dim=-1) 
                 return results

set smartspan threshold to 0.0 (negative logits are negative results)

fix smart span processor to detect spans properly