bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
459 stars 116 forks source link

bc7_litcovid returns None instances #713

Closed jason-fries closed 3 weeks ago

jason-fries commented 2 years ago

Describe the bug

bc7_litcovid generates instances with out a text field, e.g., {'id': '34', 'document_id': '34219343', 'text': None, 'labels': ['Prevention']}

Steps to reproduce the bug

Iterate through the dataset as normal.

Expected results

text should have a value (since it has labels)

Actual results

text is None in several cases

shamikbose commented 2 years ago

self-assign

shamikbose commented 2 years ago

@jason-fries There's missing data in the actual file. Should I just log it as a warning and exclude it (and others like it) from the dataset?

PMID Journal Title Abstract Keywords Publication Type Authors DOI Label
34219343 Liver Int Management of liver disease in Italy after one year of the SARS-CoV-2 pandemic: A web-based survey.   covid19;hcc;sars-cov-2;cirrhosis;liver transplant Journal Article Ponziani, Francesca Romana;Aghemo, Alessio;Cabibbo, Giuseppe;Masarone, Mario;Montagnese, Sara;Petta, Salvatore;Russo, Francesco Paolo;Lai, Quirino 10.1111/liv.14998 Prevention
shamikbose commented 2 years ago

@jason-fries This is completed in #727