bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
460 stars 116 forks source link

Create a dataset loader for CRAFT #60

Open hakunanatasha opened 2 years ago

hakunanatasha commented 2 years ago

Colorado Richly Annotated Full-Text (CRAFT) Corpus

https://github.com/UCDenver-ccp/CRAFT

uzaymacar commented 2 years ago

self-assign

hakunanatasha commented 2 years ago

Hi @uzaymacar, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)

uzaymacar commented 2 years ago

Hey @hakunanatasha, yes I am still working on this! I am planning to follow up with a PR by mid-next week.

hakunanatasha commented 2 years ago

@uzaymacar awesome! Feel free to ping me here, via your PR, or on the discord for help! I'm looking forward to your submission :cherry_blossom:

davidkartchner commented 2 years ago

self-assign

shamikbose commented 2 years ago

self-assign

shamikbose commented 2 years ago

@jason-fries There's multiple versions of this. I'm using 5.0.0, which is the latest one

jason-fries commented 2 years ago

SGTM -- just make certain the versioning is reflected in the data loader metadata.

shamikbose commented 2 years ago

Hi @jason-fries @galtay @ruisi-su I think I'm starting to understand the CRAFT dataset. I have a few questions:

  1. From what I can understand, this dataset support Tasks.COREF and Tasks.NER. Please let me know if there are other tasks it supports

  2. Corefs are somewhat tricky. There are multiple annotations of the same thing. How should that be handled? Here's an example:

        <annotation annotator="Annotator" id="1" type="identity">
            <class id="IDENTITY chain" label="IDENTITY chain"/>
            <span end="71" id="11532192-2" start="65">strain</span>
        </annotation>
        <annotation annotator="CCP Colorado Computational Pharmacology, UC Denver" id="11532192SHM_Instance_150000" type="identity">
            <class id="Noun Phrase" label="Noun Phrase"/>
            <span end="71" id="11532192-3" start="65">strain</span>
        </annotation>
  3. The NER seems to be pretty straightforward, but just to clarify, the covered types are as follows:

    • CHEBI
    • CL
    • GO_BP
    • GO_CC
    • GO_MF
    • MONDO
    • MOP
    • NCBITaxon
    • PR
    • SO
    • UBERON
  4. There's also structural annotations, but I'm not sure which task that would solve in the bigbio schema. Does this need to be implemented?

shamikbose commented 2 years ago

@ruisi-su This is implemented as a local dataset in #681 since download_and_extract() doesn't seem to work properly with the archive containing the dataset

mariosaenger commented 1 year ago

@shamikbose Are you still working on that?

shamikbose commented 1 year ago

@mariosaenger This is already implemented as a local dataset in #681 It's awaiting review