Implement CoNECo dataset

oguzserbetci commented 3 weeks ago

Name: CoNECo
Description: Complex Named Entity Corpus (CoNECo) is an annotated corpus for NER and NEN of protein-containing complexes. CoNECo comprises 1,621 documents divided into train, val and test splits with 2,052 entities, 1,976 of which are normalized to Gene Ontology.
Paper: https://www.biorxiv.org/content/early/2024/05/29/2024.05.18.594800
Data: https://zenodo.org/records/11263147

Checkbox

[ ] ~~Confirm that this PR is linked to the dataset issue.~~
[x] Create the dataloader script hub/hub_repos/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[ ] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio_hub <dataset_name> [--data_dir /path/to/local/data] --test_local.
[ ] ~~If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.~~

oguzserbetci commented 2 weeks ago

Hi @oguzserbetci, thank you for the PR!

Apart from the comments, I wondered if you have any idea why the entity counts are so much different from the paper - for instance, the paper reports 1360 mentions in the training set, but here we have 1589 in the training split.

Are there any annotations (e.g., duplicates), which are removed in the paper?
train
==========
id: 983
document_id: 983
passages: 983
entities: 1589
events: 0
coreferences: 0
relations: 0
normalized: 1310

@phlobo I have been looking into this but without any solution. I counted entities in the brat files and also the conll files using grep, and they are close to 1589 albeit not consistent... I couldn't read anything from the paper. Initial version of the dataset on Zededo has the same number of entities.

oguzserbetci commented 2 weeks ago

I checked again, there are no overflowing or duplicate offset entities. Also I was able to locate entity mentions in the text. Here is how I tested:

>>> all([np.max([np.max(ent['offsets']) for ent in sample['entities']]) <= np.max([np.max(p['offsets']) for p in sample['passages']]) for sample in data['train'] if sample['entities']])
True
>>> all([sample['passages'][0]['text'][0][ent_begin:ent_end] == ent_text for sample in data['train'] for ent in sample['entities'] for (ent_begin, ent_end), ent_text in zip(ent['offsets'], ent['text'])])
True

The number of normalized entity mentions do equate to the paper. It is curious indeed. Should I contact the paper authors?

phlobo commented 2 weeks ago

The number of normalized entity mentions do equate to the paper. It is curious indeed. Should I contact the paper authors?

@oguzserbetci: that would be amazing! :)

oguzserbetci commented 1 week ago

@phlobo I contacted an author and got the issue resolved. There were entities labeled as out-of-scope, which I now filter out and resulting entity numbers match the paper.

train
==========
id: 983
document_id: 983
passages: 983
entities: 1360
normalized: 1310
events: 0
coreferences: 0
relations: 0

test
==========
id: 318
document_id: 318
passages: 318
entities: 283
events: 0
coreferences: 0
relations: 0
normalized: 279

validation
==========
id: 320
document_id: 320
passages: 320
entities: 409
normalized: 387
events: 0
coreferences: 0
relations: 0

bigscience-workshop / biomedical

Implement CoNECo dataset #931

Checkbox