bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

Closes #873 #874

Closed WangXII closed 1 year ago

WangXII commented 1 year ago

Closes #873 - Wrong entity offsets in the tmvar_v3 datasets

Wrong offsets in PMID 21904390 are already present in the source file https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/tmVar3Corpus.txt

Solution: Manually corrected the wrong offsets in PMID 21904390 as the wrong offsets do not seem to follow any pattern.

Checkbox