bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling

458 stars 116 forks source link

Create a dataset loader for CBLUE (Chinese Biomedical Language Understanding Evaluation Benchmark) #70

Open hakunanatasha opened 2 years ago

hakunanatasha commented 2 years ago

From https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us

jason-fries commented 2 years ago

I believe the data is also available at https://github.com/CBLUEbenchmark/CBLUE if this helps! The dataset license (per the homepage above) is

The CBLUE datasets is distributed under CC BY-NC-SA 4.0.

gagan3012 commented 2 years ago

self-assign

hakunanatasha commented 2 years ago

@gagan3012 let us know if you need help with this repo! We're available on discord or via chat here

gagan3012 commented 2 years ago

Hello, I am waiting to get access to the dataset

jason-fries commented 2 years ago

Hi @gagan3012 Can you confirm that this is not the dataset? https://github.com/CBLUEbenchmark/CBLUE ?

gagan3012 commented 2 years ago

The dataset is not available at the GitHub we need to download it from https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414

rosalinesway commented 2 years ago

@gagan3012 Have you tried the Apply for Dataset button in the title section of the url? The only prerequisite to apply is probably register an account. Let me know if you run into problems in the process. Below are the details provided in the download link:

Click "apply" button and fill in necessary information. After confirming "Terms of Use", your application will be reviewed automatically within 7 days. If agreed, you can download data directly. If rejected, you can apply again according to feedback information.

jason-fries commented 2 years ago

Hi @gagan3012 Just a ping on the status of this dataset. Please let us know if you are still working on it and when you plan to submit a PR. Thanks!!