bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

Closes #873 and revises #874 #876

Open WangXII opened 1 year ago

WangXII commented 1 year ago

Closes https://github.com/bigscience-workshop/biomedical/issues/873 and revises https://github.com/bigscience-workshop/biomedical/pull/874 - Wrong entity offsets in the tmvar_v3 datasets

Wrong offsets in PMID 21904390 are already present in the source file https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/tmVar3Corpus.txt

Solution: Manually corrected the wrong offsets in PMID 21904390 as the wrong offsets do not seem to follow any pattern.

Compared to https://github.com/bigscience-workshop/biomedical/pull/874, this pull request reverts the offsets in the standard 'source' dataset back to original (but wrong) offsets provided by the original dataset and adds a new 'source_fixed' dataset with corrected offsets

Checkbox

WangXII commented 1 year ago

Thanks for the review @leonweber. Looking into the merge conflict, the branch is already up-to-date with main. It seems more like a simple change that the automatic merging strategies fails to recognize. The branch version adds an if-clause to just show the fixed annotations in the source version if the user explicitly requests it ("_source_fixed" vs "_source")

<<<<<<< tmvar_v3_fix
                if "_fixed" in self.config.name:
                    document["entities"] = self._correct_wrong_offsets(
                        document["entities"], doc.pmid
                    )
=======
                document["entities"] = self._correct_wrong_offsets(
                    document["entities"], doc.pmid
                )
>>>>>>> main