dmis-lab / biobert-pytorch

PyTorch Implementation of BioBERT
http://doi.org/10.1093/bioinformatics/btz682
Other
311 stars 107 forks source link

Empty dev.tsv files in the GAD dataset after download #41

Open MaxwellWibert opened 1 year ago

MaxwellWibert commented 1 year ago

I cloned the repo and ran download.sh, and found a dev.tsv file in each of the numbered folders, however each of those files was totally empty. Is there some other preprocessing script that is responsible for populating these files?

wonjininfo commented 1 year ago

Hi Maxwell, For the GAD dataset, we chose to evaluate our model using the 10-fold cross-validation method, as it is a very small dataset. Therefore, there is no fixed division table for Train-Dev-Test nor we have dev.tsv.

Unfortunately, GAD might not be the most ideal resource for evaluating LMs. However, in the five years since the BioBERT paper was published, there have been significant efforts in creating resources for relation extraction in NLP. This has led to the availability of other relatively abundant resources for BioRE (to name a few: DrugProt, BioRED).

MaxwellWibert commented 1 year ago

Thank you for your response! We agree GAD is not ideal as an LM evaluation dataset, however, both BioBERT and its derivatives have become common benchmarks in the field, and so often we must try to recreate your original datasets. I'm afraid my institution's decision to use GAD as a benchmarking set is above my paygrade.

Maybe this is a silly question , but did you generate the 10-fold cross-validation by just looping over the 10 train.tsv files and setting the current file to be the validation set?

In other words, is the k-fold structured as follows? first iteration 1/train.tsv for validation, 2/train.tsv through 10/train.tsv for training ... ith iteration, i/train.tsv for validation, k/train.tsv for k !=i used for training ...