CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Apache License 2.0
12 stars 2 forks source link

Temporarily remove token processing that generates errorneous results #32

Closed xuhdev closed 3 years ago

xuhdev commented 3 years ago

introduced in e740f09b0549151fce19d13f5f00e6e58d52ccb7

Relevant comment: https://github.com/CODAIT/text-extensions-for-pandas/issues/148#issuecomment-730224771

BryanCutler commented 3 years ago

@xuhdev this is the cause of https://github.com/CODAIT/text-extensions-for-pandas/issues/148#issuecomment-730224771?

0
B-LOC
B-MISC
B-ORG
I-LOC
I-LOC.
I-LOCMinn
I-MISC
I-MISC.
I-MISC12
I-MISCBAY
I-MISCCUP
I-MISCdiplomats
I-MISCFOOTBALL-RANDALL
...
xuhdev commented 3 years ago

@BryanCutler Yes

frreiss commented 3 years ago

Maybe I'm missing something here. If this change goes through, won't the download_and_correct_corpus.py script generate an version of the corrected corpus with zero token corrections applied?

xuhdev commented 3 years ago

@frreiss We had token corrections manually applied, I believe. @kmh4321

Eventually we have to figure out what went wrong in the token corrections code.

frreiss commented 3 years ago

@xuhdev if we were to switch to applying token corrections manually, we would need to provide users with detailed instructions on how to apply token corrections manually. I don't think users would appreciate that.

We need to fix the bug in the automated correction code.

xuhdev commented 3 years ago

Closed in favor of the real fix #33