Open rhdunn opened 1 year ago
Thanks for pointing these out—my sense is that email/forum data is just going to be messy sometimes, and the notion of a paragraph or sentence unit isn't always clear. I wasn't involved in the original data preprocessing, but if you have access to the LDC release that might be informative regarding these cases.
No, I don't have access to the LDC release.
Just to point out that during our work in the https://universalpropositions.github.io/, I also came across many of those sentences not only in the EWT corpus but also in the Ontonotes. Not clear for me the value of keeping those sentences in the treebanks.
I see your point that these don't add much, but if it's just a couple dozen sentences their effect will be miniscule. If we're going to go down the road of weeding out sentences that aren't really sentences, EWT has a ton that are just URLs, email signatures, or filenames. For a sample see https://universal.grew.fr/?custom=652b01f3387f2
Still, it would make the data more maintainable.
Part of the problem with these (and other partial sentences) is that they can result in sentence splitters (those using statistical, neural network, or dependency parse based models) to incorrectly split new sentences in some cases. -- I have on my list of things to do to detect incorrectly split sentences in model output, as I've seen this happen quite frequently, for example:
# text = Lawyer
# text = Bell was there and made one 'bout eight months 'fore he died.
My iniitial thinking for cases such as the URLs, email signatures, and so forth is to ensure that the next sentence is the start of a new paragraph. Though I've not tested this yet, nor written the validation rules/logic.
The URL sentence is valid, as will many of the others. The tricky case is when they combine with other sentences in the generated test data, resulting in the splitter making the wrong inferences -- this is where I suspect that having newpar
markup on these will help.
The following sentences all have
text = ?
:email-enronsent09_02-0034
email-enronsent09_02-0036
email-enronsent22_01-0067
email-enronsent26_02-0021
email-enronsent26_02-0023
email-enronsent26_02-0025
email-enronsent26_02-0028
-- This one looks like it may be due to an invalid encoding processing U+A0 (NBWS) from the proceeding sentence.email-enronsent32_01-p0009
email-enronsent33_01-0130
email-enronsent33_01-0132
email-enronsent33_01-0145
There are also several other sentences where the text is just a single character. Many of these don't have
newpar
annotations or similar to ensure that they don't combine and interfere with the surrounding sentences:answers-20111108044917AALAHtc_ans-0005
--m
answers-20111108084416AAoPgBv_ans-0010
--%
answers-20111108090913AAf83Jh_ans-p0004
--1
(list separator)answers-20111108090913AAf83Jh_ans-0011
--2
(list separator)answers-20111108090913AAf83Jh_ans-0017
--3
(list separator)answers-20111108090913AAf83Jh_ans-0021
--4
(list separator)email-enronsent27_01-0049
--m
email-enronsent27_01-0064
--m
newsgroup-groups.google.com_alt.animals_1054ad831ec01b4c_ENG_20031204_144900-0002
--s
The following look ok as they have
newpar
annotations on the sentence and the following sentence:email-enronsent09_02-p0006
--D
email-enronsent09_02-p0015
--D
email-enronsent21_01-p0007
--M
newsgroup-groups.google.com_HumorUniversity_00dd93cc9545deb3_ENG_20051130_122700-p0001
--*
I don't know what the best way to handle/markup these so they can be consistently processed (e.g. by tools like converting the CoNLL-U files to text), as there seem to be different cases here.