Closed oguzserbetci closed 1 week ago
Hi @oguzserbetci, thank you for the PR!
Apart from the comments, I wondered if you have any idea why the entity counts are so much different from the paper - for instance, the paper reports 1360 mentions in the training set, but here we have 1589 in the training split.
Are there any annotations (e.g., duplicates), which are removed in the paper?
train ========== id: 983 document_id: 983 passages: 983 entities: 1589 events: 0 coreferences: 0 relations: 0 normalized: 1310
@phlobo I have been looking into this but without any solution. I counted entities in the brat files and also the conll files using grep, and they are close to 1589 albeit not consistent... I couldn't read anything from the paper. Initial version of the dataset on Zededo has the same number of entities.
I checked again, there are no overflowing or duplicate offset entities. Also I was able to locate entity mentions in the text. Here is how I tested:
>>> all([np.max([np.max(ent['offsets']) for ent in sample['entities']]) <= np.max([np.max(p['offsets']) for p in sample['passages']]) for sample in data['train'] if sample['entities']])
True
>>> all([sample['passages'][0]['text'][0][ent_begin:ent_end] == ent_text for sample in data['train'] for ent in sample['entities'] for (ent_begin, ent_end), ent_text in zip(ent['offsets'], ent['text'])])
True
The number of normalized entity mentions do equate to the paper. It is curious indeed. Should I contact the paper authors?
The number of normalized entity mentions do equate to the paper. It is curious indeed. Should I contact the paper authors?
@oguzserbetci: that would be amazing! :)
@phlobo I contacted an author and got the issue resolved. There were entities labeled as out-of-scope, which I now filter out and resulting entity numbers match the paper.
train
==========
id: 983
document_id: 983
passages: 983
entities: 1360
normalized: 1310
events: 0
coreferences: 0
relations: 0
test
==========
id: 318
document_id: 318
passages: 318
entities: 283
events: 0
coreferences: 0
relations: 0
normalized: 279
validation
==========
id: 320
document_id: 320
passages: 320
entities: 409
normalized: 387
events: 0
coreferences: 0
relations: 0
Checkbox
Confirm that this PR is linked to the dataset issue.hub/hub_repos/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema.datasets.load_dataset
function.python -m tests.test_bigbio_hub <dataset_name> [--data_dir /path/to/local/data] --test_local
.If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.