Open tskir opened 10 months ago
hi @tskir
Thank you for documenting this issue. The reasons for this problem are twofold:
We are aware of this issue and made a work for the Spans for our open targets submission at https://github.com/ML4LitS/otar-maintenance/blob/main/OTAR_new_pipeline_fulltext_bioformer_cluster_all.py
for all_ent in pred:
my_sentence = batch_sentences[count]
if all_ent:
x_list_=[]
for ent in all_ent:
if my_sentence[ent['start']:ent['end']] in ['19', 'COVID', 'COVID-19']:
ent['entity_group'] = 'DS'
x_list_.append([ent['start'], ent['end'], ent['entity_group'], my_sentence[ent['start']:ent['end']]])
Fixed in the newer version: I have added new data with COVID-19 annotations and retrained the model recently. Hopefully, this should be sorted.
As I was experimenting with the repository, I stumbled upon some interesting behaviour which I thought might be useful to document.
I was running the model as described in the README (with a minor fix in https://github.com/ML4LitS/annotation_models/pull/4) and tested it on a few real abstracts. Almost all seem to work great, but here are three examples where things go awry. It looks like in some cases the "COVID-19" token is getting split into two, with the first part sometimes being ommitted (example 1).
Unfortunately I couldn't find an easy fix (I experimented a bit with tokeniser init parameters, but to no avail), so for now this is just a report without a proposed solution. It may also be a known problem already — in this case my apologies!
Example 1
Source: https://www.scielo.br/j/ibju/a/ymkTGVgBVd3ZhQLdVgMVQsQ/?format=html
Text:
Output (emphasis >>> mine):
Example 2
Source: https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2774707
Text:
Output:
Example 3
Source: https://www.sciencedirect.com/science/article/pii/S0163445320307179
Text:
Output: