bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling

458 stars 116 forks source link

JNLPBA implementation issues -- missing passages / entity only implementation #714

Open jason-fries opened 2 years ago

jason-fries commented 2 years ago

Currently JNLPBA is setup such that every token and tag is a single entity. This is not the correct setup for this task/schema -- we need to create passages with entity spans.

shamikbose commented 2 years ago

self-assign

shamikbose commented 2 years ago

@jason-fries Am I reading this wrong or is this dataset loader using itself?

jason-fries commented 2 years ago

Looks like it. I would look at the datasets hub implementation and copy over any code that supports reading the source schema directly vs. calling datasets like this.

shamikbose commented 2 years ago

@jason-fries Here's an example datapoint from the corpus:

###MEDLINE:95369245

IL-2    B-DNA
gene    I-DNA
expression  O
and O
NF-kappa    B-protein
B   I-protein
activation  O
through O
CD28    B-protein
requires    O
reactive    O
oxygen  O
production  O
by  O
5-lipoxygenase  B-protein
.   O

What should be in passages and entities in the kb schema? Looking through some training examples and comparing with Medline, it seems that the ###MEDLINE is the medline id, the first sentence is the title and the rest of the sentences make up the abstract. Should the title and the abstracts be recreated to keep in tune with other PUBMED datasets? As an aside, the datasets implementation of this also seems to be wrong. There are two files in the corpus with the same information, but the datasets implementation reads both of them and inserts them into the dataset, meaning there's repeated information

galtay commented 2 years ago

this is an example of a bigbio dataset loader that attempted to start with the existing huggingface datasets implementation and then modify it. there was a full discussion in the PR ... let me see if I can track it down ... [EDIT] its here https://github.com/bigscience-workshop/biomedical/pull/589

initially we were attempting to leverage the existing implementation by using it directly ... now I think it would be cleaner (as jason said) to use the fundamentals of the code from the HF datasets implementation but not directly "load with HF datasets and then modify"

shamikbose commented 2 years ago

@galtay Thank you for that link. That's really helpful! I'm going to build it out like the one you outlined in this comment.

jason-fries commented 2 years ago

@shamikbose following the guide @galtay outlined will work great. One request -- make certain you are loading the raw JNLPBA annotated data available on the GENIA BioNLP / JNLPBA Share Task website and not wrapping the datasets dataloader.

shamikbose commented 2 years ago

Yeah, I’m reusing the code used in the datasets dataloader to download the raw data from the wesbite

On Thu, Jun 30, 2022 at 6:41 PM Jason Alan Fries @.***> wrote:

@shamikbose https://github.com/shamikbose following the guide @galtay https://github.com/galtay outlined will work great. One request -- make certain you are loading the raw JNLPBA annotated data available on the GENIA BioNLP / JNLPBA Share Task http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004 website and not wrapping the datasets dataloader.

— Reply to this email directly, view it on GitHub https://github.com/bigscience-workshop/biomedical/issues/714#issuecomment-1171744239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMD3OJMTXZTSWFY6A26PTGTVRYPB3ANCNFSM52DJ7LEQ . You are receiving this because you were mentioned.Message ID: @.***>

--

-Regards, Shamik Bose