Closed hakunanatasha closed 1 year ago
Found here https://huggingface.co/datasets/pubmed_qa, however without official splits.
@nomisto good catch - i think we'll implement with official splits. There is a small amount of datasets currently overlapping with the original library.
@jason-fries @galtay @leonweber thoughts?
@SamuelCahyawijaya can you let us know if you still intend to work on this? We'd like to update our project board. Please let us know by Friday, April 8, so we can plan accordingly. You can ping me in a comment via @hakunanatasha or on Discord with @admins
@hakunanatasha : Yes, actually I am working on this one right now and I find a problem with the PQA-L(abelled) as the official split on the github link above is actually a 10-fold CV for the training & dev set. Should I use only a single split (combining both train & dev) or should I provide different splits for each fold?
Hi @SamuelCahyawijaya
For multiple splits, I see the default approach
(1) Create a source/bigbio for the combined splits (so 1 train/dev set, I think)
(2) Create source/bigbio for each split individually
Bioasq may be a useful example https://github.com/bigscience-workshop/biomedical/blob/0e35df219519fea9b14c58d26b6e26c81415160f/examples/bioasq.py#L494
Also - I noticed you have a PR open for this, would you mind updating with the splits? I'll change the reqs at some point too.
@hakunanatasha : I see, noted, let me add the 10-fold split then 👍🏻
@SamuelCahyawijaya @galtay @hakunanatasha unit tests failing on bigbio_schema; please advise whether this is an unexpected behavior Unit test output is pasted below
INFO:__main__:args: Namespace(dataloader_path='biodatasets/pubmed_qa/pubmed_qa.py', data_dir=None, config_name=None)
INFO:__main__:all_config_names: ['pubmed_qa_artificial_source', 'pubmed_qa_unlabeled_source', 'pubmed_qa_artificial_bigbio_qa', 'pubmed_qa_unlabeled_bigbio_qa', 'pubmed_qa_labeled_fold0_source', 'pubmed_qa_labeled_fold1_source', 'pubmed_qa_labeled_fold2_source', 'pubmed_qa_labeled_fold3_source', 'pubmed_qa_labeled_fold4_source', 'pubmed_qa_labeled_fold5_source', 'pubmed_qa_labeled_fold6_source', 'pubmed_qa_labeled_fold7_source', 'pubmed_qa_labeled_fold8_source', 'pubmed_qa_labeled_fold9_source', 'pubmed_qa_labeled_fold0_bigbio_qa', 'pubmed_qa_labeled_fold1_bigbio_qa', 'pubmed_qa_labeled_fold2_bigbio_qa', 'pubmed_qa_labeled_fold3_bigbio_qa', 'pubmed_qa_labeled_fold4_bigbio_qa', 'pubmed_qa_labeled_fold5_bigbio_qa', 'pubmed_qa_labeled_fold6_bigbio_qa', 'pubmed_qa_labeled_fold7_bigbio_qa', 'pubmed_qa_labeled_fold8_bigbio_qa', 'pubmed_qa_labeled_fold9_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_artificial_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_artificial_source
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_artificial_source to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_artificial_source/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
Downloading: 2.21kB [00:00, 715kB/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/6d6d623774b8015a704724c6ab74b78515b2a3a5376e6caee3ef9525dfb60eee/pqaa_train_set.json'
----------------------------------------------------------------------
Ran 1 test in 3.657s
FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_unlabeled_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_unlabeled_source
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_unlabeled_source to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_unlabeled_source/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
Downloading: 2.21kB [00:00, 204kB/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/c7dce2d93387f8e9dd420b3e238dff21faeddafcf933f4f02b8b619a9d73b242/ori_pqau.json'
----------------------------------------------------------------------
Ran 1 test in 0.532s
FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_artificial_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_artificial_bigbio_qa
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_artificial_bigbio_qa to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_artificial_bigbio_qa/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/6d6d623774b8015a704724c6ab74b78515b2a3a5376e6caee3ef9525dfb60eee/pqaa_train_set.json'
----------------------------------------------------------------------
Ran 1 test in 0.136s
FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_unlabeled_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_unlabeled_bigbio_qa
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_unlabeled_bigbio_qa to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_unlabeled_bigbio_qa/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/c7dce2d93387f8e9dd420b3e238dff21faeddafcf933f4f02b8b619a9d73b242/ori_pqau.json'
----------------------------------------------------------------------
Ran 1 test in 0.140s
FAILED (errors=1)
Hi @sunnnymskang, just wondering whether it is caused by the datasets package version problem. Before we have a similar issue for pqaa
and pqau
data split which we have discussed here.
We need to upgrade the dependency to datasets>=2.0.0
since there is a datasets
package bug with the google drive link download as mentioned here. Could you confirm that you tested using datasets>=2.0.0
?
If the problem remains even with the correct datasets
version, I can investigate further on this issue this Wednesday.
PubmedQA in dataloaders.
From https://github.com/pubmedqa/pubmedqa