CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
217 stars 34 forks source link

Add detailed description text to "Identifying Incorrect Labels" tutorial #148

Closed frreiss closed 3 years ago

frreiss commented 3 years ago

We've checked the experiment code from our CoNLL-2020 paper under tutorials/corpus, as a collection of 4 notebooks. Our intent is to turn these notebooks into a detailed tutorial on analyzing model outputs and corpus labels using Text Extensions for Pandas.

To complete the tutorial, we need to add explanatory text to the notebooks by adding Markdown cells in between the current set of code cells.

This issue covers the task of adding this explanatory text. We anticipate that there will be several pull requests associated with this issue, as there is quite a bit of code to document.

stefan-it commented 3 years ago

Hi @frreiss ,

would it be possible to get a kind of "pre-access" to the corrected error lists and scripts :thinking:

I would really like to run experiments with Flair on the corrected version :hugs:

Thanks in advance :heart:

Stefan

frreiss commented 3 years ago

Thanks for your interest, @stefan-it ! The repository with the list of corrections should go live tomorrow. It will be at https://github.com/CODAIT/Identifying-Incorrect-Labels-In-CoNLL-2003

frreiss commented 3 years ago

@stefan-it the repository at https://github.com/CODAIT/Identifying-Incorrect-Labels-In-CoNLL-2003 is now live.

Note that we are working on some additional cleanup and will tag a second release soon, so you may want to wait a day or two.

stefan-it commented 3 years ago

:+1: thanks for that :hugs:

I currently see some label mismatches:

0
B-LOC
B-MISC
B-ORG
I-LOC
I-LOC.
I-LOCMinn
I-MISC
I-MISC.
I-MISC12
I-MISCBAY
I-MISCCUP
I-MISCdiplomats
I-MISCFOOTBALL-RANDALL
I-MISCleader
I-MISCLouis-based
I-MISCMAKE
I-MISCopen
I-MISCPILOTS
I-MISCquits
I-MISCRETIRES
I-MISCRULES-AFL
I-MISCSEES
I-MISCspokesman
I-MISCSTATE
I-MISCstill
I-MISCTrade
I-MISCWINS
I-ORG
I-ORGAthens
I-ORGFe
I-ORGgiven
I-ORGv
I-PER
I-PER.
I-PP
O
O)
Orebels-Interfax

so Im really excited for the release :heart:

frreiss commented 3 years ago

@stefan-it thanks for finding that regression. We are tracking down the cause.

frreiss commented 3 years ago

BTW, the problem is that the output file is missing some carriage returns. As a short-term workaround, it looks like you should be able to just add a newline after each of the garbled tokens. For example, I-LOCMinn becomes I-LOC\nMinn.

frreiss commented 3 years ago

I think we've added enough descriptive text to the tutorial to be able to close this issue.