allenai / scibert

A BERT model for scientific text.
https://arxiv.org/abs/1903.10676
Apache License 2.0
1.47k stars 214 forks source link

JNLPBA dataset #26

Open stefan-it opened 5 years ago

stefan-it commented 5 years ago

Hi,

thanks for releasing the SciBERT model and datasets :heart:

I'm currently integrating an importing method of the NER data into the flair library.

I checked the number of imported sentences for all dataset splits and there's a mismatch of 2404 sentences compared to the total sentences number in table 2 of the paper (24,806). Then I checked the JNLPBA dataset and it seems that all -DOCSTART- O lines were also counted, which is I think a bit redundant.

The number of training and development sentences is also a bit different than the values reported in the BioBERT paper. The BioBERT uses a split of 14,690 / 3,856 / 3,856, whereas the provided data in this repository uses a split of 16,807 / 1,739 / 3,856. Could you confirm this?

Thanks + regards,

Stefan

kyleclo commented 5 years ago

Hey Stefan, Thanks for your interest in the project. I'll look into the line counting issue, and update the reported numbers. As for the dataset splits in JNLPBA, it might be an issue of using different sources of the dataset files.
The JNLPBA dataset we used was pulled from https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/JNLPBA which results in files w/ number of lines:

   49600 dev.txt
  105703 test.txt
  465497 train.txt
shreyashub commented 5 years ago

@kyleclo, The JNLPBA dataset you used was pulled from https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/JNLPBA. Then why do the tags in your processed data show B-Entity instead of say, B-Disease?

ibeltagy commented 5 years ago

@shreyashub, I think you are talking about bc5cdr not JNLPBA because JNLPBA doesn't have Disease category. For bc5cdr, we used a version that we had in s2 that dropped the entity types and combined bc5cdr-disease and bc5cdr-chem in one. I agree it would have been better to use the original the dataset.