bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

Closes #865 #870

Closed nachollorca closed 1 year ago

nachollorca commented 1 year ago

Closes #865 Implementation of a loader for BRONCO150 dataset. It presents two of the matters currently discussed in the Discord group: unary relations and K-fold train splits.

Checkbox

INFO:__main__:args: Namespace(dataset_name='bronco', data_dir='C:\\Users\\admin\\Desktop\\BRONCO150', config_name=None, bypass_splits=[], bypass_keys=[], bypass_split_key_pairs=[], test_local=True)
INFO:__main__:Running (Local) Unit Test
INFO:__main__:all_config_names: ['bronco_source', 'bronco_bigbio_kb']
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/bronco/bronco.py
INFO:__main__:self.CONFIG_NAME: bronco_source
INFO:__main__:self.DATA_DIR: C:\Users\admin\Desktop\BRONCO150
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.bronco.3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7.bronco' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\bronco\\3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7\\bronco.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION', 'NAMED_ENTITY_DISAMBIGUATION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name bronco_source
WARNING:datasets.builder:Using custom data configuration bronco_source-7a1d288ef98b5779
Dataset =  bronco
DatasetModule(module_path='datasets_modules.datasets.bronco.3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7.bronco', hash='3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7', builder_kwargs={'hash': '3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7', 'base_path': 'bigbio\\hub\\hub_repos\\bronco'})
Downloading and preparing dataset bronco/bronco_source to C:/Users/admin/.cache/huggingface/datasets/bronco/bronco_source-7a1d288ef98b5779/1.0.0/3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:01,  1.19s/ examples]
Generating train split: 5 examples [00:01,  4.97 examples/s]

Dataset bronco downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/bronco/bronco_source-7a1d288ef98b5779/1.0.0/3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7. Subsequent calls will reuse this data.

  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 51.78it/s]
INFO:__main__:schema = source
.
----------------------------------------------------------------------
Ran 1 test in 1.912s

OK
INFO:__main__:self.DATASET_NAME: bigbio/hub/hub_repos/bronco/bronco.py
INFO:__main__:self.CONFIG_NAME: bronco_bigbio_kb
INFO:__main__:self.DATA_DIR: C:\Users\admin\Desktop\BRONCO150
INFO:__main__:importing module .... 
INFO:__main__:imported module <module 'datasets_modules.datasets.bronco.3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7.bronco' from 'C:\\Users\\admin\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\bronco\\3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7\\bronco.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=['NAMED_ENTITY_RECOGNITION', 'NAMED_ENTITY_DISAMBIGUATION']
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB'}
INFO:__main__:Checking load_dataset with config name bronco_bigbio_kb
WARNING:datasets.builder:Using custom data configuration bronco_bigbio_kb-7a1d288ef98b5779
Downloading and preparing dataset bronco/bronco_bigbio_kb to C:/Users/admin/.cache/huggingface/datasets/bronco/bronco_bigbio_kb-7a1d288ef98b5779/1.0.0/3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7...

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:00,  1.02 examples/s]
Generating train split: 3 examples [00:01,  3.38 examples/s]

Dataset bronco downloaded and prepared to C:/Users/admin/.cache/huggingface/datasets/bronco/bronco_bigbio_kb-7a1d288ef98b5779/1.0.0/3be6d3970545324c95c2b776563ec9a0aa3f88c5469d78d3cf4bc4d3f51577e7. Subsequent calls will reuse this data.

  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 45.75it/s]
INFO:__main__:schema = KB
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 17746 unique IDs
INFO:__main__:Gathering dataset statistics
INFO:__main__:Testing schema for: train
INFO:__main__:Checking if referenced IDs are properly mapped
INFO:__main__:KB ONLY: Checking passage offsets
INFO:__main__:KB ONLY: Checking entity offsets
INFO:__main__:KB ONLY: multi-label `db_id`
INFO:__main__:KB ONLY: Checking event offsets
INFO:__main__:KB ONLY: Checking coref offsets
INFO:__main__:KB ONLY: multi-label `type` fields
.
----------------------------------------------------------------------
Ran 1 test in 17.758s

OK
train
==========
id: 5
document_id: 5
passages: 8981
entities: 8760
normalized: 8760
events: 0
coreferences: 0
relations: 0
mariosaenger commented 1 year ago

Hej @nachollorca!

Thanks for implementing the data set. Generally the code looks fine for me. However, the source schema is relatively technical (~ an accurate representation of BioC). You could have opted for a more abstract scheme here. But this is still perfectly fine I think.

I just have one minor issue. Please run the code formatting one more time again: make check_file=bigbio/hub/hub_repos/bronco/bronco.py

Best, Mario

nachollorca commented 1 year ago

@mariosaenger you are welcome!

Yes, I tried to keep the source schema identical to the provided .xml, which follows indeed BioC format. If you have any specific ideas in mind I can give them a shot.

I applied the formatting changes already.

All the best!

mariosaenger commented 1 year ago

For many of the NER corpora we rely on a schema quite analogous to kb schema and (may) extend it with data set specific information. For BRONCO this could be something like:

        "entities": [
            {
                "id": datasets.Value("string"),
                "type": datasets.Value("string"),
                "text": datasets.Sequence(datasets.Value("string")),
                "offsets": datasets.Sequence([datasets.Value("int32")]),
                "levelOfTruth": datasets.Value("string"),
                "localisation": datasets.Value("string"),
                "normalized": [
                    {
                        "db_name": datasets.Value("string"),
                        "db_id": datasets.Value("string"),
                    }
                ],
            }
        ],
hakunanatasha commented 1 year ago

@mariosaenger is this ok to merge?

mariosaenger commented 1 year ago

As far as I am concerned, it can be merged, unless @nachollorca wants to adapt the source scheme?

nachollorca commented 1 year ago

Ey @mariosaenger, @hakunanatasha. The changes are not difficult and I would definitively make them. I am only on a tight schedule at the moment. You can either merge them now and I'll submit another PR once I touch it up or wait until I have the time.