Open WangXII opened 1 year ago
Thanks for the review @leonweber. Looking into the merge conflict, the branch is already up-to-date with main. It seems more like a simple change that the automatic merging strategies fails to recognize. The branch version adds an if-clause to just show the fixed annotations in the source version if the user explicitly requests it ("_source_fixed" vs "_source")
<<<<<<< tmvar_v3_fix
if "_fixed" in self.config.name:
document["entities"] = self._correct_wrong_offsets(
document["entities"], doc.pmid
)
=======
document["entities"] = self._correct_wrong_offsets(
document["entities"], doc.pmid
)
>>>>>>> main
Closes https://github.com/bigscience-workshop/biomedical/issues/873 and revises https://github.com/bigscience-workshop/biomedical/pull/874 - Wrong entity offsets in the tmvar_v3 datasets
Wrong offsets in PMID 21904390 are already present in the source file https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/tmVar3Corpus.txt
Solution: Manually corrected the wrong offsets in PMID 21904390 as the wrong offsets do not seem to follow any pattern.
Compared to https://github.com/bigscience-workshop/biomedical/pull/874, this pull request reverts the offsets in the standard 'source' dataset back to original (but wrong) offsets provided by the original dataset and adds a new 'source_fixed' dataset with corrected offsets
Checkbox
BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema.datasets.load_dataset
function.python -m tests.test_bigbio_hub <dataset_name> [--data_dir /path/to/local/data] --test_local
.