bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
452 stars 114 forks source link

Implement CoNECo dataset #931

Closed oguzserbetci closed 1 week ago

oguzserbetci commented 3 weeks ago

Checkbox

oguzserbetci commented 2 weeks ago

Hi @oguzserbetci, thank you for the PR!

Apart from the comments, I wondered if you have any idea why the entity counts are so much different from the paper - for instance, the paper reports 1360 mentions in the training set, but here we have 1589 in the training split.

Are there any annotations (e.g., duplicates), which are removed in the paper?

train
==========
id: 983
document_id: 983
passages: 983
entities: 1589
events: 0
coreferences: 0
relations: 0
normalized: 1310

@phlobo I have been looking into this but without any solution. I counted entities in the brat files and also the conll files using grep, and they are close to 1589 albeit not consistent... I couldn't read anything from the paper. Initial version of the dataset on Zededo has the same number of entities.

oguzserbetci commented 2 weeks ago

I checked again, there are no overflowing or duplicate offset entities. Also I was able to locate entity mentions in the text. Here is how I tested:

>>> all([np.max([np.max(ent['offsets']) for ent in sample['entities']]) <= np.max([np.max(p['offsets']) for p in sample['passages']]) for sample in data['train'] if sample['entities']])
True
>>> all([sample['passages'][0]['text'][0][ent_begin:ent_end] == ent_text for sample in data['train'] for ent in sample['entities'] for (ent_begin, ent_end), ent_text in zip(ent['offsets'], ent['text'])])
True

The number of normalized entity mentions do equate to the paper. It is curious indeed. Should I contact the paper authors?

phlobo commented 2 weeks ago

The number of normalized entity mentions do equate to the paper. It is curious indeed. Should I contact the paper authors?

@oguzserbetci: that would be amazing! :)

oguzserbetci commented 1 week ago

@phlobo I contacted an author and got the issue resolved. There were entities labeled as out-of-scope, which I now filter out and resulting entity numbers match the paper.

train
==========
id: 983
document_id: 983
passages: 983
entities: 1360
normalized: 1310
events: 0
coreferences: 0
relations: 0

test
==========
id: 318
document_id: 318
passages: 318
entities: 283
events: 0
coreferences: 0
relations: 0
normalized: 279

validation
==========
id: 320
document_id: 320
passages: 320
entities: 409
normalized: 387
events: 0
coreferences: 0
relations: 0
image001