Open jason-fries opened 2 years ago
@jason-fries Am I reading this wrong or is this dataset loader using itself?
Looks like it. I would look at the datasets
hub implementation and copy over any code that supports reading the source schema directly vs. calling datasets like this.
@jason-fries Here's an example datapoint from the corpus:
###MEDLINE:95369245
IL-2 B-DNA
gene I-DNA
expression O
and O
NF-kappa B-protein
B I-protein
activation O
through O
CD28 B-protein
requires O
reactive O
oxygen O
production O
by O
5-lipoxygenase B-protein
. O
What should be in passages and entities in the kb schema?
Looking through some training examples and comparing with Medline, it seems that the ###MEDLINE is the medline id, the first sentence is the title and the rest of the sentences make up the abstract. Should the title and the abstracts be recreated to keep in tune with other PUBMED datasets?
As an aside, the datasets
implementation of this also seems to be wrong. There are two files in the corpus with the same information, but the datasets
implementation reads both of them and inserts them into the dataset, meaning there's repeated information
this is an example of a bigbio dataset loader that attempted to start with the existing huggingface datasets implementation and then modify it. there was a full discussion in the PR ... let me see if I can track it down ... [EDIT] its here https://github.com/bigscience-workshop/biomedical/pull/589
initially we were attempting to leverage the existing implementation by using it directly ... now I think it would be cleaner (as jason said) to use the fundamentals of the code from the HF datasets implementation but not directly "load with HF datasets and then modify"
@galtay Thank you for that link. That's really helpful! I'm going to build it out like the one you outlined in this comment.
@shamikbose following the guide @galtay outlined will work great. One request -- make certain you are loading the raw JNLPBA annotated data available on the GENIA BioNLP / JNLPBA Share Task website and not wrapping the datasets
dataloader.
Yeah, I’m reusing the code used in the datasets
dataloader to download
the raw data from the wesbite
On Thu, Jun 30, 2022 at 6:41 PM Jason Alan Fries @.***> wrote:
@shamikbose https://github.com/shamikbose following the guide @galtay https://github.com/galtay outlined will work great. One request -- make certain you are loading the raw JNLPBA annotated data available on the GENIA BioNLP / JNLPBA Share Task http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004 website and not wrapping the datasets dataloader.
— Reply to this email directly, view it on GitHub https://github.com/bigscience-workshop/biomedical/issues/714#issuecomment-1171744239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMD3OJMTXZTSWFY6A26PTGTVRYPB3ANCNFSM52DJ7LEQ . You are receiving this because you were mentioned.Message ID: @.***>
--
-Regards, Shamik Bose
Currently JNLPBA is setup such that every token and tag is a single entity. This is not the correct setup for this task/schema -- we need to create passages with entity spans.