bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

Closes #863 #866

Closed nachollorca closed 1 year ago

nachollorca commented 1 year ago

Closes #863

Checkbox

nachollorca commented 1 year ago

The output of python -m tests.test_bigbio_hub ggponc2 --data_dir /path/to/local/data] --test_local:

INFO:__main__:args: Namespace(dataset_name='ggponc2', data_dir='C:\\Users\\admin\\ggponc_annotation\\data', config_name=None, bypass_splits=[], bypass_keys=[], bypass_split_key_pairs=[], test_local=True)
INFO:__main__:Running (Local) Unit Test
INFO:__main__:all_config_names: ['ggponc2_fine_long_source', 'ggponc2_fine_short_source', 'ggponc2_coarse_long_source', 'ggponc2_coarse_short_source', 'ggponc2_fine_long_bigbio_kb', 'ggponc2_fine_short_bigbio_kb', 'ggponc2_coarse_long_bigbio_kb', 'ggponc2_coarse_short_bigbio_kb']
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/ggponc2/ggponc2.py
INFO:__main__:self.CONFIG_NAME: ggponc2_fine_long_source
INFO:__main__:self.DATA_DIR: C:\Users\admin\ggponc_annotation\data
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\ggponc2\\9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e\\ggponc2.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name ggponc2_fine_long_source
WARNING:datasets.builder:Using custom data configuration ggponc2_fine_long_source-2efdcaf275cdedd5
Dataset =  ggponc2
DatasetModule(module_path='datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2', hash='9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e', builder_kwargs={'hash': '9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e', 'base_path': 'bigbio\\hub\\hub_repos\\ggponc2'})
Downloading and preparing dataset ggponc2/ggponc2_fine_long_source to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_fine_long_source-2efdcaf275cdedd5/2.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:02,  2.25s/ examples]
Generating train split: 32 examples [00:02, 18.90 examples/s]
...
Generating train split: 7128 examples [00:25, 266.00 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]
Generating test split: 1 examples [00:01,  1.99s/ examples]
Generating test split: 13 examples [00:02,  8.41 examples/s]
...
Generating test split: 1528 examples [00:21, 75.19 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]
Generating validation split: 1 examples [00:02,  2.14s/ examples]
Generating validation split: 10 examples [00:02,  5.98 examples/s]
...
Generating validation split: 1528 examples [00:21, 68.51 examples/s]

Dataset ggponc2 downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_fine_long_source-2efdcaf275cdedd5/2.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|##########| 3/3 [00:00<00:00, 45.30it/s]
INFO:__main__:schema = source
.
----------------------------------------------------------------------
Ran 1 test in 70.049s

OK
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/ggponc2/ggponc2.py
INFO:__main__:self.CONFIG_NAME: ggponc2_fine_short_source
INFO:__main__:self.DATA_DIR: C:\Users\admin\ggponc_annotation\data
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\ggponc2\\9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e\\ggponc2.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name ggponc2_fine_short_source
WARNING:datasets.builder:Using custom data configuration ggponc2_fine_short_source-2efdcaf275cdedd5
Downloading and preparing dataset ggponc2/ggponc2_fine_short_source to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_fine_short_source-2efdcaf275cdedd5/2.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:02,  2.37s/ examples]
Generating train split: 34 examples [00:02, 19.07 examples/s]
...
Generating train split: 7135 examples [00:26, 193.68 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]
Generating test split: 1 examples [00:02,  2.25s/ examples]
Generating test split: 13 examples [00:02,  7.47 examples/s]
...
Generating test split: 1528 examples [00:21, 74.01 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]
Generating validation split: 1 examples [00:02,  2.26s/ examples]
Generating validation split: 10 examples [00:02,  5.66 examples/s]
...
Generating validation split: 1528 examples [00:21, 69.56 examples/s]

Dataset ggponc2 downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_fine_short_source-2efdcaf275cdedd5/2.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|##########| 3/3 [00:00<00:00, 43.47it/s]
INFO:__main__:schema = source
.
----------------------------------------------------------------------
Ran 1 test in 71.140s

OK
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/ggponc2/ggponc2.py
INFO:__main__:self.CONFIG_NAME: ggponc2_coarse_long_source
INFO:__main__:self.DATA_DIR: C:\Users\admin\ggponc_annotation\data
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\ggponc2\\9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e\\ggponc2.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name ggponc2_coarse_long_source
WARNING:datasets.builder:Using custom data configuration ggponc2_coarse_long_source-2efdcaf275cdedd5
Downloading and preparing dataset ggponc2/ggponc2_coarse_long_source to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_coarse_long_source-2efdcaf275cdedd5/2.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:02,  2.04s/ examples]
Generating train split: 33 examples [00:02, 21.33 examples/s]
...
Generating train split: 7105 examples [00:26, 234.58 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]
Generating test split: 1 examples [00:02,  2.13s/ examples]
Generating test split: 13 examples [00:02,  7.90 examples/s]
...
Generating test split: 1528 examples [00:21, 73.29 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]
Generating validation split: 1 examples [00:02,  2.11s/ examples]
Generating validation split: 10 examples [00:02,  6.05 examples/s]
...
Generating validation split: 1528 examples [00:21, 69.39 examples/s]

Dataset ggponc2 downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_coarse_long_source-2efdcaf275cdedd5/2.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|##########| 3/3 [00:00<00:00, 44.77it/s]
INFO:__main__:schema = source
.
----------------------------------------------------------------------
Ran 1 test in 70.791s

OK
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/ggponc2/ggponc2.py
INFO:__main__:self.CONFIG_NAME: ggponc2_coarse_short_source
INFO:__main__:self.DATA_DIR: C:\Users\admin\ggponc_annotation\data
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\ggponc2\\9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e\\ggponc2.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name ggponc2_coarse_short_source
WARNING:datasets.builder:Using custom data configuration ggponc2_coarse_short_source-2efdcaf275cdedd5
Downloading and preparing dataset ggponc2/ggponc2_coarse_short_source to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_coarse_short_source-2efdcaf275cdedd5/2.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:02,  2.29s/ examples]
Generating train split: 32 examples [00:02, 18.58 examples/s]
...
Generating train split: 7119 examples [00:26, 189.08 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]
Generating test split: 1 examples [00:02,  2.53s/ examples]
Generating test split: 10 examples [00:02,  5.17 examples/s]
...
Generating test split: 1527 examples [00:24, 69.32 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]
Generating validation split: 1 examples [00:02,  2.25s/ examples]
Generating validation split: 10 examples [00:02,  5.62 examples/s]
...
Generating validation split: 1528 examples [00:24, 62.84 examples/s]

Dataset ggponc2 downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_coarse_short_source-2efdcaf275cdedd5/2.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|##########| 3/3 [00:00<00:00, 46.58it/s]
INFO:__main__:schema = source
.
----------------------------------------------------------------------
Ran 1 test in 77.060s

OK
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/ggponc2/ggponc2.py
INFO:__main__:self.CONFIG_NAME: ggponc2_fine_long_bigbio_kb
INFO:__main__:self.DATA_DIR: C:\Users\admin\ggponc_annotation\data
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\ggponc2\\9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e\\ggponc2.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name ggponc2_fine_long_bigbio_kb
WARNING:datasets.builder:Using custom data configuration ggponc2_fine_long_bigbio_kb-2efdcaf275cdedd5
Downloading and preparing dataset ggponc2/ggponc2_fine_long_bigbio_kb to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_fine_long_bigbio_kb-2efdcaf275cdedd5/1.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:02,  2.07s/ examples]
Generating train split: 31 examples [00:02, 19.67 examples/s]
...
Generating train split: 7132 examples [00:27, 245.34 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]
Generating test split: 1 examples [00:02,  2.18s/ examples]
Generating test split: 12 examples [00:02,  7.18 examples/s]
...
Generating test split: 1523 examples [00:22, 72.00 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]
Generating validation split: 1 examples [00:01,  1.99s/ examples]
Generating validation split: 10 examples [00:02,  6.33 examples/s]
...
Generating validation split: 1528 examples [00:22, 70.25 examples/s]

Dataset ggponc2 downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_fine_long_bigbio_kb-2efdcaf275cdedd5/1.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|##########| 3/3 [00:00<00:00, 45.89it/s]
INFO:__main__:schema = KB
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 47341 unique IDs
INFO:__main__:Gathering dataset statistics
INFO:__main__:Testing schema for: train
INFO:__main__:Testing schema for: test
INFO:__main__:Testing schema for: validation
INFO:__main__:Checking if referenced IDs are properly mapped
INFO:__main__:KB ONLY: Checking passage offsets
INFO:__main__:KB ONLY: Checking entity offsets
INFO:__main__:KB ONLY: multi-label `db_id`
INFO:__main__:KB ONLY: Checking event offsets
INFO:__main__:KB ONLY: Checking coref offsets
INFO:__main__:KB ONLY: multi-label `type` fields
.
----------------------------------------------------------------------
Ran 1 test in 332.240s

OK
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/ggponc2/ggponc2.py
INFO:__main__:self.CONFIG_NAME: ggponc2_fine_short_bigbio_kb
INFO:__main__:self.DATA_DIR: C:\Users\admin\ggponc_annotation\data
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\ggponc2\\9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e\\ggponc2.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name ggponc2_fine_short_bigbio_kb
WARNING:datasets.builder:Using custom data configuration ggponc2_fine_short_bigbio_kb-2efdcaf275cdedd5
train
==========
id: 7135
document_id: 7135
passages: 59515
entities: 153747
normalized: 0
events: 0
coreferences: 0
relations: 0

test
==========
id: 1529
document_id: 1529
passages: 13714
entities: 34414
normalized: 0
events: 0
coreferences: 0
relations: 0

validation
==========
id: 1529
document_id: 1529
passages: 12770
entities: 33042
normalized: 0
events: 0
coreferences: 0
relations: 0

Downloading and preparing dataset ggponc2/ggponc2_fine_short_bigbio_kb to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_fine_short_bigbio_kb-2efdcaf275cdedd5/1.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:01,  1.31s/ examples]
Generating train split: 31 examples [00:01, 30.03 examples/s]
...
Generating train split: 7135 examples [00:26, 194.71 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]
Generating test split: 1 examples [00:02,  2.49s/ examples]
Generating test split: 12 examples [00:02,  6.33 examples/s]
...
Generating test split: 1527 examples [00:23, 74.14 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]
Generating validation split: 1 examples [00:02,  2.38s/ examples]
Generating validation split: 10 examples [00:02,  5.39 examples/s]
...
Generating validation split: 1523 examples [00:22, 68.60 examples/s]

Dataset ggponc2 downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_fine_short_bigbio_kb-2efdcaf275cdedd5/1.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|##########| 3/3 [00:00<00:00, 46.14it/s]
INFO:__main__:schema = KB
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 51122 unique IDs
INFO:__main__:Gathering dataset statistics
INFO:__main__:Testing schema for: train
INFO:__main__:Testing schema for: test
INFO:__main__:Testing schema for: validation
INFO:__main__:Checking if referenced IDs are properly mapped
INFO:__main__:KB ONLY: Checking passage offsets
INFO:__main__:KB ONLY: Checking entity offsets
INFO:__main__:KB ONLY: multi-label `db_id`
INFO:__main__:KB ONLY: Checking event offsets
INFO:__main__:KB ONLY: Checking coref offsets
INFO:__main__:KB ONLY: multi-label `type` fields
.
----------------------------------------------------------------------
Ran 1 test in 347.006s

OK
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/ggponc2/ggponc2.py
INFO:__main__:self.CONFIG_NAME: ggponc2_coarse_long_bigbio_kb
INFO:__main__:self.DATA_DIR: C:\Users\admin\ggponc_annotation\data
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\ggponc2\\9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e\\ggponc2.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name ggponc2_coarse_long_bigbio_kb
WARNING:datasets.builder:Using custom data configuration ggponc2_coarse_long_bigbio_kb-2efdcaf275cdedd5
train
==========
id: 7135
document_id: 7135
passages: 59515
entities: 171358
normalized: 0
events: 0
coreferences: 0
relations: 0

test
==========
id: 1529
document_id: 1529
passages: 13714
entities: 38309
normalized: 0
events: 0
coreferences: 0
relations: 0

validation
==========
id: 1529
document_id: 1529
passages: 12770
entities: 36823
normalized: 0
events: 0
coreferences: 0
relations: 0

Downloading and preparing dataset ggponc2/ggponc2_coarse_long_bigbio_kb to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_coarse_long_bigbio_kb-2efdcaf275cdedd5/1.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:01,  1.22s/ examples]
Generating train split: 32 examples [00:01, 33.04 examples/s]
...
Generating train split: 7129 examples [00:25, 254.17 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]
Generating test split: 1 examples [00:02,  2.27s/ examples]
Generating test split: 11 examples [00:02,  6.33 examples/s
...
Generating test split: 1526 examples [00:22, 71.13 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]
Generating validation split: 1 examples [00:02,  2.23s/ examples]
Generating validation split: 10 examples [00:02,  5.69 examples/s]
...
Generating validation split: 1523 examples [00:22, 62.60 examples/s]

Dataset ggponc2 downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_coarse_long_bigbio_kb-2efdcaf275cdedd5/1.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|##########| 3/3 [00:00<00:00, 46.15it/s]
INFO:__main__:schema = KB
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 47120 unique IDs
INFO:__main__:Gathering dataset statistics
INFO:__main__:Testing schema for: train
INFO:__main__:Testing schema for: test
INFO:__main__:Testing schema for: validation
INFO:__main__:Checking if referenced IDs are properly mapped
INFO:__main__:KB ONLY: Checking passage offsets
INFO:__main__:KB ONLY: Checking entity offsets
INFO:__main__:KB ONLY: multi-label `db_id`
INFO:__main__:KB ONLY: Checking event offsets
INFO:__main__:KB ONLY: Checking coref offsets
INFO:__main__:KB ONLY: multi-label `type` fields
.
----------------------------------------------------------------------
Ran 1 test in 329.313s

OK
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/ggponc2/ggponc2.py
INFO:__main__:self.CONFIG_NAME: ggponc2_coarse_short_bigbio_kb
INFO:__main__:self.DATA_DIR: C:\Users\admin\ggponc_annotation\data
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.ggponc2.9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e.ggponc2' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\ggponc2\\9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e\\ggponc2.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name ggponc2_coarse_short_bigbio_kb
WARNING:datasets.builder:Using custom data configuration ggponc2_coarse_short_bigbio_kb-2efdcaf275cdedd5
train
==========
id: 7135
document_id: 7135
passages: 59515
entities: 152702
normalized: 0
events: 0
coreferences: 0
relations: 0

test
==========
id: 1529
document_id: 1529
passages: 13714
entities: 34188
normalized: 0
events: 0
coreferences: 0
relations: 0

validation
==========
id: 1529
document_id: 1529
passages: 12770
entities: 32821
normalized: 0
events: 0
coreferences: 0
relations: 0

Downloading and preparing dataset ggponc2/ggponc2_coarse_short_bigbio_kb to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_coarse_short_bigbio_kb-2efdcaf275cdedd5/1.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:01,  1.32s/ examples]
Generating train split: 32 examples [00:01, 30.74 examples/s]
...
Generating train split: 7118 examples [00:26, 237.60 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]
Generating test split: 1 examples [00:02,  2.51s/ examples]
Generating test split: 13 examples [00:02,  6.74 examples/s]
...
Generating test split: 1527 examples [00:22, 76.71 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]
Generating validation split: 1 examples [00:02,  2.34s/ examples]
Generating validation split: 10 examples [00:02,  5.48 examples/s]
...
Generating validation split: 1523 examples [00:22, 69.63 examples/s]

Dataset ggponc2 downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/ggponc2/ggponc2_coarse_short_bigbio_kb-2efdcaf275cdedd5/1.0.0/9c40f95f41a5efa136a8360a5d8e4dc31d6e9d6601246122ea6dad3d4e916a4e. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|##########| 3/3 [00:00<00:00, 41.65it/s]
INFO:__main__:schema = KB
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 51122 unique IDs
INFO:__main__:Gathering dataset statistics
INFO:__main__:Testing schema for: train
INFO:__main__:Testing schema for: test
INFO:__main__:Testing schema for: validation
INFO:__main__:Checking if referenced IDs are properly mapped
INFO:__main__:KB ONLY: Checking passage offsets
INFO:__main__:KB ONLY: Checking entity offsets
INFO:__main__:KB ONLY: multi-label `db_id`
INFO:__main__:KB ONLY: Checking event offsets
INFO:__main__:KB ONLY: Checking coref offsets
INFO:__main__:KB ONLY: multi-label `type` fields
.
----------------------------------------------------------------------
Ran 1 test in 346.169s

OK
train
==========
id: 7135
document_id: 7135
passages: 59515
entities: 171358
normalized: 0
events: 0
coreferences: 0
relations: 0

test
==========
id: 1529
document_id: 1529
passages: 13714
entities: 38309
normalized: 0
events: 0
coreferences: 0
relations: 0

validation
==========
id: 1529
document_id: 1529
passages: 12770
entities: 36823
normalized: 0
events: 0
coreferences: 0
relations: 0
nachollorca commented 1 year ago

Is there anything else left to do here @galtay ?