ML4LitS / annotation_models

Creative Commons Zero v1.0 Universal
0 stars 1 forks source link

Inconsistent token splitting/non-detection issue with "COVID-19" #5

Open tskir opened 5 months ago

tskir commented 5 months ago

As I was experimenting with the repository, I stumbled upon some interesting behaviour which I thought might be useful to document.

I was running the model as described in the README (with a minor fix in https://github.com/ML4LitS/annotation_models/pull/4) and tested it on a few real abstracts. Almost all seem to work great, but here are three examples where things go awry. It looks like in some cases the "COVID-19" token is getting split into two, with the first part sometimes being ommitted (example 1).

Unfortunately I couldn't find an easy fix (I experimented a bit with tokeniser init parameters, but to no avail), so for now this is just a report without a proposed solution. It may also be a known problem already — in this case my apologies!

Example 1

Source: https://www.scielo.br/j/ibju/a/ymkTGVgBVd3ZhQLdVgMVQsQ/?format=html

Text:

The SARS-CoV-2, a newly identified β-coronavirus, is the causative agent of the third large-scale pandemic from the last two decades. The outbreak started in December 2019 in Wuhan City, Hubei province in China. The patients presented clinical symptoms of dry cough, fever, dyspnea, and bilateral lung infiltrates on imaging. By February 2020, The World Health Organization (WHO) named the disease as Coronavirus Disease 2019 (COVID-19). The Coronavirus Study Group (CSG) of the International Committee on Taxonomy of Viruses (ICTV) recognized and designated this virus as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The SARS-CoV-2 uses the same host receptor, angiotensin-converting enzyme 2 (ACE2), used by SARS-CoV to infect humans. One hypothesis of SARSCoV-2 origin indicates that it is likely that bats serve as reservoir hosts for SARSCoV-2, being the intermediate host not yet determined. The predominant route of transmission of SARS-CoV-2 is from human to human. As of May 10th 2020, the number of worldwide confirmed COVID-19 cases is over 4 million, while the number of global deaths is around 279.000 people. The United States of America (USA) has the highest number of COVID-19 cases with over 1.3 million cases followed by Spain, Italy, United Kingdom, Russia, France and Germany with over 223.000, 218.000, 215.000, 209.000, 176.000, and 171.000 cases, respectively.

Output (emphasis >>> mine):

[4, 14, 'SARS-CoV-2', 'OG', 0.97980404]
[35, 48, 'β-coronavirus', 'OG', 0.9183075]
[401, 425, 'Coronavirus Disease 2019', 'DS', 0.8462568]
>>> [432, 435, '-19', 'DS', 0.8777046]
[442, 453, 'Coronavirus', 'OG', 0.9838477]
[518, 525, 'Viruses', 'OG', 0.79415864]
[564, 569, 'virus', 'OG', 0.99294454]
[573, 620, 'severe acute respiratory syndrome coronavirus 2', 'OG', 0.93976814]
[622, 632, 'SARS-CoV-2', 'OG', 0.9186972]
[639, 649, 'SARS-CoV-2', 'OG', 0.976299]
[679, 710, 'angiotensin-converting enzyme 2', 'GP', 0.98774415]
[712, 716, 'ACE2', 'GP', 0.99562484]
[727, 735, 'SARS-CoV', 'OG', 0.95477873]
[746, 752, 'humans', 'OG', 0.9985006]
[772, 781, 'SARSCoV-2', 'OG', 0.96963006]
[822, 826, 'bats', 'OG', 0.9986842]
[856, 865, 'SARSCoV-2', 'OG', 0.9669266]
[956, 966, 'SARS-CoV-2', 'OG', 0.9631585]
[975, 980, 'human', 'OG', 0.99873465]
[984, 989, 'human', 'OG', 0.99881846]
[1046, 1054, 'COVID-19', 'DS', 0.9811101]
[1201, 1209, 'COVID-19', 'DS', 0.98579264]

Example 2

Source: https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2774707

Text:

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the etiology of coronavirus disease 2019 (COVID-19), is readily transmitted person to person. Optimal control of COVID-19 depends on directing resources and health messaging to mitigation efforts that are most likely to prevent transmission, but the relative importance of such measures has been disputed.

Output:

[0, 47, 'Severe acute respiratory syndrome coronavirus 2', 'OG', 0.94914675]
[49, 59, 'SARS-CoV-2', 'OG', 0.9634031]
[78, 102, 'coronavirus disease 2019', 'DS', 0.88417387]
>>> [104, 109, 'COVID', 'OG', 0.49266842]
>>> [109, 112, '-19', 'DS', 0.96979624]
[175, 183, 'COVID-19', 'DS', 0.94552827]

Example 3

Source: https://www.sciencedirect.com/science/article/pii/S0163445320307179

Text:

A significant number of reported COVID-19 cases can be traced back to superspreader events (SSEs), where a disproportionally large number of secondary cases relative to the standard reproductive rate, R0, are initiated. Although a superspreader is an individual who undergoes more viral shedding and transmission than others, it appears likely that environmental factors have a substantial role in SSEs. We categorise SSEs into two distinct groups: ‘societal’ and ‘isolated’ SSEs.

Output:

>>> [33, 38, 'COVID', 'OG', 0.70883656]
>>> [38, 41, '-19', 'DS', 0.72856104]
tsantosh7 commented 5 months ago

hi @tskir

Thank you for documenting this issue. The reasons for this problem are twofold:

  1. COVID-19 was not in the training set
  2. Non overlapping span in the training set

We are aware of this issue and made a work for the Spans for our open targets submission at https://github.com/ML4LitS/otar-maintenance/blob/main/OTAR_new_pipeline_fulltext_bioformer_cluster_all.py

for all_ent in pred:
            my_sentence = batch_sentences[count]
            if all_ent:
                x_list_=[]
                for ent in all_ent:
                    if my_sentence[ent['start']:ent['end']] in ['19', 'COVID', 'COVID-19']:
                        ent['entity_group'] = 'DS'
                    x_list_.append([ent['start'], ent['end'], ent['entity_group'], my_sentence[ent['start']:ent['end']]])

Fixed in the newer version: I have added new data with COVID-19 annotations and retrained the model recently. Hopefully, this should be sorted.