UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
200 stars 42 forks source link

Several sentences just have a '?' or other single character. #415

Open rhdunn opened 1 year ago

rhdunn commented 1 year ago

The following sentences all have text = ?:

  1. email-enronsent09_02-0034
  2. email-enronsent09_02-0036
  3. email-enronsent22_01-0067
  4. email-enronsent26_02-0021
  5. email-enronsent26_02-0023
  6. email-enronsent26_02-0025
  7. email-enronsent26_02-0028 -- This one looks like it may be due to an invalid encoding processing U+A0 (NBWS) from the proceeding sentence.
  8. email-enronsent32_01-p0009
  9. email-enronsent33_01-0130
  10. email-enronsent33_01-0132
  11. email-enronsent33_01-0145

There are also several other sentences where the text is just a single character. Many of these don't have newpar annotations or similar to ensure that they don't combine and interfere with the surrounding sentences:

  1. answers-20111108044917AALAHtc_ans-0005 -- m
  2. answers-20111108084416AAoPgBv_ans-0010 -- %
  3. answers-20111108090913AAf83Jh_ans-p0004 -- 1 (list separator)
  4. answers-20111108090913AAf83Jh_ans-0011 -- 2 (list separator)
  5. answers-20111108090913AAf83Jh_ans-0017 -- 3 (list separator)
  6. answers-20111108090913AAf83Jh_ans-0021 -- 4 (list separator)
  7. email-enronsent27_01-0049 -- m
  8. email-enronsent27_01-0064 -- m
  9. newsgroup-groups.google.com_alt.animals_1054ad831ec01b4c_ENG_20031204_144900-0002 -- s

The following look ok as they have newpar annotations on the sentence and the following sentence:

  1. email-enronsent09_02-p0006 -- D
  2. email-enronsent09_02-p0015 -- D
  3. email-enronsent21_01-p0007 -- M
  4. newsgroup-groups.google.com_HumorUniversity_00dd93cc9545deb3_ENG_20051130_122700-p0001 -- *

I don't know what the best way to handle/markup these so they can be consistently processed (e.g. by tools like converting the CoNLL-U files to text), as there seem to be different cases here.

nschneid commented 1 year ago

Thanks for pointing these out—my sense is that email/forum data is just going to be messy sometimes, and the notion of a paragraph or sentence unit isn't always clear. I wasn't involved in the original data preprocessing, but if you have access to the LDC release that might be informative regarding these cases.

rhdunn commented 1 year ago

No, I don't have access to the LDC release.

arademaker commented 1 year ago

Just to point out that during our work in the https://universalpropositions.github.io/, I also came across many of those sentences not only in the EWT corpus but also in the Ontonotes. Not clear for me the value of keeping those sentences in the treebanks.

nschneid commented 1 year ago

I see your point that these don't add much, but if it's just a couple dozen sentences their effect will be miniscule. If we're going to go down the road of weeding out sentences that aren't really sentences, EWT has a ton that are just URLs, email signatures, or filenames. For a sample see https://universal.grew.fr/?custom=652b01f3387f2

arademaker commented 1 year ago

Still, it would make the data more maintainable.

rhdunn commented 1 year ago

Part of the problem with these (and other partial sentences) is that they can result in sentence splitters (those using statistical, neural network, or dependency parse based models) to incorrectly split new sentences in some cases. -- I have on my list of things to do to detect incorrectly split sentences in model output, as I've seen this happen quite frequently, for example:

# text = Lawyer
# text = Bell was there and made one 'bout eight months 'fore he died.

My iniitial thinking for cases such as the URLs, email signatures, and so forth is to ensure that the next sentence is the start of a new paragraph. Though I've not tested this yet, nor written the validation rules/logic.

The URL sentence is valid, as will many of the others. The tricky case is when they combine with other sentences in the generated test data, resulting in the splitter making the wrong inferences -- this is where I suspect that having newpar markup on these will help.