bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
458 stars 115 forks source link

Create dataset loader for PubMedQA #25

Closed hakunanatasha closed 1 year ago

hakunanatasha commented 2 years ago

From https://github.com/pubmedqa/pubmedqa

nomisto commented 2 years ago

Found here https://huggingface.co/datasets/pubmed_qa, however without official splits.

SamuelCahyawijaya commented 2 years ago

self-assign

hakunanatasha commented 2 years ago

@nomisto good catch - i think we'll implement with official splits. There is a small amount of datasets currently overlapping with the original library.

@jason-fries @galtay @leonweber thoughts?

hakunanatasha commented 2 years ago

@SamuelCahyawijaya can you let us know if you still intend to work on this? We'd like to update our project board. Please let us know by Friday, April 8, so we can plan accordingly. You can ping me in a comment via @hakunanatasha or on Discord with @admins

SamuelCahyawijaya commented 2 years ago

@hakunanatasha : Yes, actually I am working on this one right now and I find a problem with the PQA-L(abelled) as the official split on the github link above is actually a 10-fold CV for the training & dev set. Should I use only a single split (combining both train & dev) or should I provide different splits for each fold?

hakunanatasha commented 2 years ago

Hi @SamuelCahyawijaya

For multiple splits, I see the default approach

(1) Create a source/bigbio for the combined splits (so 1 train/dev set, I think)

(2) Create source/bigbio for each split individually

Bioasq may be a useful example https://github.com/bigscience-workshop/biomedical/blob/0e35df219519fea9b14c58d26b6e26c81415160f/examples/bioasq.py#L494

hakunanatasha commented 2 years ago

Also - I noticed you have a PR open for this, would you mind updating with the splits? I'll change the reqs at some point too.

SamuelCahyawijaya commented 2 years ago

@hakunanatasha : I see, noted, let me add the 10-fold split then 👍🏻

sunnnymskang commented 2 years ago

@SamuelCahyawijaya @galtay @hakunanatasha unit tests failing on bigbio_schema; please advise whether this is an unexpected behavior Unit test output is pasted below

INFO:__main__:args: Namespace(dataloader_path='biodatasets/pubmed_qa/pubmed_qa.py', data_dir=None, config_name=None)
INFO:__main__:all_config_names: ['pubmed_qa_artificial_source', 'pubmed_qa_unlabeled_source', 'pubmed_qa_artificial_bigbio_qa', 'pubmed_qa_unlabeled_bigbio_qa', 'pubmed_qa_labeled_fold0_source', 'pubmed_qa_labeled_fold1_source', 'pubmed_qa_labeled_fold2_source', 'pubmed_qa_labeled_fold3_source', 'pubmed_qa_labeled_fold4_source', 'pubmed_qa_labeled_fold5_source', 'pubmed_qa_labeled_fold6_source', 'pubmed_qa_labeled_fold7_source', 'pubmed_qa_labeled_fold8_source', 'pubmed_qa_labeled_fold9_source', 'pubmed_qa_labeled_fold0_bigbio_qa', 'pubmed_qa_labeled_fold1_bigbio_qa', 'pubmed_qa_labeled_fold2_bigbio_qa', 'pubmed_qa_labeled_fold3_bigbio_qa', 'pubmed_qa_labeled_fold4_bigbio_qa', 'pubmed_qa_labeled_fold5_bigbio_qa', 'pubmed_qa_labeled_fold6_bigbio_qa', 'pubmed_qa_labeled_fold7_bigbio_qa', 'pubmed_qa_labeled_fold8_bigbio_qa', 'pubmed_qa_labeled_fold9_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_artificial_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_artificial_source
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_artificial_source to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_artificial_source/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
Downloading: 2.21kB [00:00, 715kB/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
    raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/6d6d623774b8015a704724c6ab74b78515b2a3a5376e6caee3ef9525dfb60eee/pqaa_train_set.json'

----------------------------------------------------------------------
Ran 1 test in 3.657s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_unlabeled_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_unlabeled_source
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_unlabeled_source to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_unlabeled_source/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
Downloading: 2.21kB [00:00, 204kB/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
    raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/c7dce2d93387f8e9dd420b3e238dff21faeddafcf933f4f02b8b619a9d73b242/ori_pqau.json'

----------------------------------------------------------------------
Ran 1 test in 0.532s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_artificial_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_artificial_bigbio_qa
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_artificial_bigbio_qa to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_artificial_bigbio_qa/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
    raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/6d6d623774b8015a704724c6ab74b78515b2a3a5376e6caee3ef9525dfb60eee/pqaa_train_set.json'

----------------------------------------------------------------------
Ran 1 test in 0.136s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_unlabeled_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_unlabeled_bigbio_qa
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_unlabeled_bigbio_qa to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_unlabeled_bigbio_qa/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
    raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/c7dce2d93387f8e9dd420b3e238dff21faeddafcf933f4f02b8b619a9d73b242/ori_pqau.json'

----------------------------------------------------------------------
Ran 1 test in 0.140s

FAILED (errors=1)
SamuelCahyawijaya commented 2 years ago

Hi @sunnnymskang, just wondering whether it is caused by the datasets package version problem. Before we have a similar issue for pqaa and pqau data split which we have discussed here.

We need to upgrade the dependency to datasets>=2.0.0 since there is a datasets package bug with the google drive link download as mentioned here. Could you confirm that you tested using datasets>=2.0.0?

If the problem remains even with the correct datasets version, I can investigate further on this issue this Wednesday.

hakunanatasha commented 1 year ago

PubmedQA in dataloaders.