bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
439 stars 111 forks source link

Closes #907 #906

Closed GullyBurns closed 7 months ago

GullyBurns commented 7 months ago

Name: CZI Disease Research State Model Description: Research article document classification dataset based on aspects of disease research. Currently, the dataset consists of three subsets: (A) classifies title/abstracts of papers into most popular subtypes of clinical, basic, and translational papers (~20k papers); (B) identifies whether a title/abstract of a paper describes substantive research into Quality of Life (~10k papers); (C) identifies if a paper is a natural history study (~10k papers). These classifications are particularly relevant in rare disease research, a field that is generally understudied. Task: Document Classification for types of research experiments Paper: In Preparation Data: https://github.com/chanzuckerberg/DRSM-corpus/ License: CC0 Motivation: (1) These are medium/large sized human-curated corpora (>10K); (2) They address an understudied, high-value subfield (rare disease); (3) This forms the basis of a new collaboration between NCATs and CZI is likely to be an expanding set as more work is done.