UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
269 stars 245 forks source link

Allow representing all space characters of the original text in the CoNLL-U format. #332

Closed foxik closed 7 years ago

foxik commented 8 years ago

CoNLL-U format currently does not specify how to represent all space characters of the original plain text. It only specifies the SpaceAfter=No feature denoting that the current token is not followed by a space in the original text.

However, the ability to represent all space characters would be useful both for:

Therefore I propose to add the following two features to the MISC column:

The new features have very similar name to existing SpaceAfter, but as they interact with each other, I think it is fine.

The new features can be present in any token (but not word, similarly to SpaceAfter) and have the following semantics:

The content of the SpacesBefore/SpacesAfter features has to be escaped in such a way that it never contains SPACE, TAB, CR, LF, PIPE characters, and it should encode SPACEs efficiently. Many possibilites come to mind:

Personally I vote for the last possibility -- C-like escaping with \s and \p. It is true that some existing escaping system would help with decoding, but 1) the existing systems seem to handle space quite ineffectively, 2) an existing system does not help with encoding, as it most likely would not escape a space 3) decode the proposed system is a trivial sequence of replaces 4) the proposed system has unique representation of the original text (which is not true for many escaping systems, but I think it is a useful feature in this case).

If a consensus is reached, I will document the new features on the http://universaldependencies.org/format.html site.

foxik commented 8 years ago

Also if a consensus is reached, it will be implemented in UDPipe.

dan-zeman commented 8 years ago

Why would you use \p? "|" is not a whitespace character, thus it should be a normal token and not a SpacesAfter attribute. On the other hand, people may need to encode TAB, CR and LF.

foxik commented 8 years ago

I can imagine that in some circumstances the SpacesAfter/SpacesBefore might contain non-space characters. Although this is not useful from point of view of UD treebanks, it is useful when using CoNLL-U as a pipeline format. For example, if the input is some kind of XML file with elements encoding some meta information, it would make sense to consider the XML elements "non-tokens" and save then to SpacesAfter and SpacesBefore. I.e., <h1>Hello</h1> could be encoded as Hello\t...\tSpacesAfter=</h1>|SpacesBefore=<h1>

The names of the features can be misleading in this usecase and maybe should be different, something like TextAfter/Before, NontokenAfter/Before or IgnoredAfter/Before. But I did not particularly like any of those, and SpacesAfter correspond to SpaceAfter with which it interacts, so I proposed SpacesAfter/Before, while assuming it could contain non-space characters.

I am definitely open to different feature names, if you have any suggestions :-)

dan-zeman commented 8 years ago

Fine with me. Then it actually is a "space" from the point of view of treebank annotation. But this should be briefly mentioned in the documentation, too.

foxik commented 8 years ago

Noted, I will mention it if the proposal gets in.

Do you have any thoughts about the escaping scheme?

dan-zeman commented 8 years ago

Not sure whether we also want/have to escape the other space characters in the higher Unicode space. Otherwise, the \s etc. sound reasonable to me. First, it is short => readable. Second, it is what I often use elsewhere (but this is quite subjective argument of course).

Note that the MISC attributes are not required to appear in alphabetical order (unlike FEATS) and I think it is OK so. Thus we may also use the logical order here, i.e. SpacesBefore=...|SpacesAfter=....

As this is something that does not directly interfere with existing UD guidelines, nor with released data, I think we do not even have to wait for v2 guidelines with it. But let's wait for the other UD'ers wether they have something to say here.

liljao commented 8 years ago

I definitely second this proposal! Based on our recent effort to align the Norwegian treebank with the corresponding raw texts we felt there was important information lost regarding spacing.

Just a minor detail: should there be an order of preference for the spacesBefore=...|spacesAfter=.... E.g. in sentence1 \n sentence2, should we use spacesAfteron the last token of sentence1 or spacesBeforeon the first token insentence2`?

foxik commented 8 years ago

Regarding the preference -- the current proposal allows SpacesBefore on the first token of every sentence. Therefore, both possibilites in the above example are valid.

In theory it would be possible to allow SpacesBefore on the first token of the first sentence only (i.e., in the above example, SpacesAfter on last token of sentence1 would have to be used). However, such a rule is inconvenient in situations, when multiple "documents" (from a logical point of view, marked for example using sent_id) are encoded in one CoNLL-U file, because in this situation it makes sense to use SpacesBefore on beginning of every (logical) document.

Another example is an NLP pipeline, which processes data "on the fly" -- in this situation, every paragraph of the plain text can be processed individually (sentences cannot go over paragraph boundary). If a paragraph starts with spaces, it is more convenient to store the spaces using SpacesBefore on the first token of this paragraph, than in SpacesAfter on the last token of the previous sentence (because that would mean the previous sentence could not be processed until a following non-space character is found).

So I would say that SpacesAfter are preferred if possible, but SpacesBefore are allowed on first token of any sentence.

sebschu commented 8 years ago

Yes, I also think that would be a great thing to have (we already have something similar in CoreNLP so that people can reconstruct sentences after tokenization) and I'm also in favor of C-like escaping.

Instead of having the SpacesBefore attribute as a token property, we could also consider having an optional sentence-level comment before each sentence. Then we wouldn't have this token-level property that is only allowed to be used on the first token, which might be confusing to some people. Or we could just allow it on every token.

dan-zeman commented 8 years ago

I would not make it a sentence-level comment when the remaining pieces are token-level attributes. Allowing SpacesBefore on every token sounds OK to me, in the sense that the validator would not complain, and processing software should be able to deal with it. Generating it on non-first tokens would not be recommended in the normal cases, but e.g. when it is used to hide XML markup, as @foxik suggests above, it may be sometimes even more intuitive, so I would not ban it.

foxik commented 8 years ago

Allowing SpacesBefore everywhere sounds reasonable (actually, the same argument I used for allowing SpacesBefore in every sentence can be used for allowing it on every token, as Dan does above).

I will prepare the changes as a pull request, so that others can discuss them. (But probably at the beginning of the next week.)

andre-martins commented 8 years ago

Is SpacesBefore/SpacesAfter covering all cases of interest? Some tokenizers do some small reorderings sometimes (e.g. switching double quotes and punctuation, depending on the language) and normalize characters (e.g. punctuation and some unicode characters). Do we want to support this? The standard approach here is to have the full sentences somewhere (maybe in a separate file) and to encode the character-level start/end positions of each word token.

2016-08-25 12:13 GMT+01:00 Milan Straka notifications@github.com:

Allowing SpacesBefore everywhere sounds reasonable (actually, the same argument I used for allowing SpacesBefore in every sentence can be used for allowing it on every token, as Dan does above).

I will prepare the changes as a pull request, so that others can discuss them. (But probably at the beginning of the next week.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/332#issuecomment-242352066, or mute the thread https://github.com/notifications/unsubscribe-auth/ABU1wxHSo87JF4FAgYn8jmUnjB8eW3FOks5qjXjKgaJpZM4JqPoW .

fginter commented 8 years ago

Hi

Sorry I'm a bit late to this discussion. I think @andre-martins follows the right lead here, with the character offsets. I myself would much rather see that, than a complex mixture of SpacesBefore/SpacesAfter and escaping. A single attribute charoff=3:7 saying that characters 3,4,5,6 of the original sentence form this token, would imho cover all situations.

Filip

dan-zeman commented 8 years ago

SpacesBefore/After would probably be a rare feature while CharOff would have to be present at every single token in the data (and the full sentence would have to be there in addition, of course).

But I agree that if replacements and reorderings have to be supported then this is the way to go. Not sure whether they have to, though.

foxik commented 8 years ago

I think both use cases are useful at times. Sometimes you want to reconstruct the data at some point (for example when you started with a XML) and you would like all the data to be in the document -- then SpacesBefore and SpacesAfter seem the straightforward way to do it.

Note that if we want to use something like TokenRange=3:5, we should also specify how to encode the original sentence -- the easiest would probably be to encode the sentence in a comment (only the \r and \n would have to be escaped).

When I started with the proposal, I was thinking about suggesting both alternatives, but then I decided to go with just the SpacesBefore/After. But you are right, the SpacesBefore/After do not support reorderings and replaces.

BTW, if we want the TokenRange to be more effective (and also playing together with SpaceAfter), we could define a reasonable default -- the length would be the length of the form, and the offset would be 0 for the first token, and either last_token_end+1 or last_token_end (depending on whether previous token has SpaceAfter=No). This way, the TokenRange would be needed only when 1) the original token has different length than the form, or 2) there is more than 1 "space" character between the tokens.

So if we want a more general correspondence with the original text, I would:

Thoughts?

fginter commented 8 years ago

I think the defaults are a good idea, so we do not clutter the file. Orig sentence as a comment is I think the way to go (we already do this for Finnish). I know that some treebanks will not be able to provide the orig sentence, but will be able to provide the SpaceAfter=No (legal reasons), so that usecase is also supported.

foxik commented 8 years ago

(BTW, even if you could not provide the original sentence, but could provide the spaces only, it would still be possible to do it -- the sentence would consist of spaces only and the TokenRanges would have zero length :-)

Note that "standardizing" the sentence in the comment interacts with #273 -- especially the format. The last suggestion in #273 is ^# ([a-z_]+) *= *(.*)$, but that has a serious disadvantage -- the spaces at the beginning of the sentence cannot be represented (because all spaces after the = are "eaten up"). Also #273 suggests sentence (and some other variants) as the name of the attribute containing the sentence text.

In order to deal with the spacing issue, I suggest we use ^# [a-z_0-9]+=(.*)$ in #273, and also define escaping rules (\r, \n and \\). Then I also suggest to use text instead of sentence to store the text of the sentence (because sentence suggests also other meanings like sentence id, while text does not).

foxik commented 8 years ago

BTW, it seems to me that the SpacesBefore/SpacesAfter are simpler than the sentence_text+ranges+defaults [because they are quite straightforward and do not require sentence-level comments], and slightly more effective when no reorderings and replacements are used [as they do not need to store the original tokens]; however, they do not support reorderings and replacements. One possiblity would be to handle the reorderings/replacements differently (replacements by using for example OriginalToken=... and reorderings like ReorderOffset=).

On the other hand, storing the whole sentence in the comment seems to be a commonly asked-for feature.

dan-zeman commented 8 years ago

@foxik I second that.

I don't see why we should support reordering. To me it is something else than what UD is about, and people who want to do it in CoNLL-U should define the mechanism separately.

Supporting replacement sounds reasonable, it also interacts with handling of typos, see #330, but then something like OriginalToken=... is quite OK.

Full text of sentence (the # text comment) can still be provided; as long as you do not count inter-sentence whitespace as part of any sentence, it is much easier. That could be covered by the SpacesBefore/After attributes, which would be the main device for reconstructing the source text.

fginter commented 8 years ago

For the record: I think the current SpaceAfter=No was just fine as it was and I personally do not see CoNLL-U as a universal data communication format. If the original sentence text is given as a comment, the tokens can be found in it, if someone really cares about extra spaces. So my primary vote would go to "include the sentence as a comment, leave the rest as is".

But if this still needs to be redesigned... ...the SpacesBefore/After just seems quite opaque with all the escaping, ambiguities of where SpacesBefore can be used, etc. Character offsets supports all situations I can think of.

martinpopel commented 8 years ago

I would like to know what are the use cases for

Of course, all these are necessary for reconstructing the original raw text, but my question is for what real-word use cases do you need to reconstruct it so exactly?

Even with all the suggestions above, we may not be able to reconstruct the original fully: the original may not be in utf8 (not even in Unicode) or it may contain invalid Unicode encoding. Note also that for preserving the original (XML) markup SpacesAfter is not enough. There may be markup inside tokens.

As always there are pros and cons. By preserving these distinctions we complicate more or less the CoNLL-U specification. I can imagine spaces between sentences are important for training segmenters on noisy data (where such spaces are missing), but if this is the only application, we could sacrifice the possibility of using CoNLL-U for such training.

That said, I like @fginter 's suggestion "include the sentence as a comment, leave the rest as is". If someone needs character offsets for the explicit alignment, they can always use it, but I am not sure we should standardize it now.

amir-zeldes commented 8 years ago

I think the use case for preserving whitespace has to do with aligning the dependencies to a different format. Some multilayer corpora might have typographical or document-structural annotations that refer to stand-off offsets in an original plain text or HTML file, often with XML + x-pointer (examples include the PAULA XML format and GrAF). NLP tools might also need to return dependency information back to a representation that preserves whitespace (for example, the spacy toolchain keeps it throughout).

Even within a pure treebank, sometimes the reason for an automatic sentence split by a sentence tokenizer is based on multiple white space (think of a date and author line not separated by period, but by multiple white spaces), so that the whitespace lets you reconstruct why sentence splitting was done in a certain way.

Also @dan-zeman - I know you opposed delimiters for original sentence text in #273 but maybe it's worth considering it given some of the problems that SpaceBefore/After and offsets raise. Then offsets could be marked only if they are non-trivial with respect to the original sentence text, and normal spaces before and after could go unannotated.

dan-zeman commented 8 years ago

@amir-zeldes - I would personally prefer saying that characters before and after the last token of a sentence are not part of the sentence. If it is necessary to store such inter-sentence material, it should not go to the text attribute, and (on a second thought) I would not put it to individual tokens either. I would put it to sentence-level comment different from text. And I now lean towards not standardizing it here.

foxik commented 8 years ago

I agree that having a sentence comment relying on spaces is weird.

Note that there seemed to be people interested in the functionality (preserving multiple spaces). There are also people in companies asking us about this. I thing it would be a shame not to deal with the multiple spaces between tokens just because it seems complicated and nonelegant.

Personally I am in favor of the original proposal -- the SpacesAfter and SpacesBefore. The cons that have been brought up:

Pros from my point of view: except for the technical issue of escaping, SpacesBefore/SpacesAfter seem to be straightforward and obvious. If someone is happy with SpaceAfter, they can keep using it, but if they need more functionality, they can extend it easily.

I think it makes little sense to standardize the character offsets without allowing spaces in various places of the original sentence text (i.e., disallowing newlines inside and at the end, which is not possible without newlines) -- it cannot be used to preserve the spaces.

Just a motivation for preserving spaces at the end of sentence -- in English, some sentence breaks seem to be forced by paragraph ends in the text. If we had the full spacing information, we would now it, and would not train the segmenter on these cases.

foxik commented 8 years ago

I am not sure what the current consensus is. To sum up:

Personally I am still interested in the proposal (the SpacesAfter/SpacesBefore with escaping), for the following reasons:

I understand that I can use the feature without standardizing it, but I believe it is useful to more people, so I think it would be better to standardize it (and not let multiple people reinvent it, each dealing with the tricky corners like escaping and inter-sentence spaces differently). Also note that if someone is not interested in more details than SpaceAfter=No provides, they are not affected at all.

Any thoughts, anyone?

fginter commented 8 years ago

Hi. My thoughts:

How about?

F

fginter commented 8 years ago

A text like:

I have
a dog.

The dog
is red.

would then look like

#text: I have\na dog.\n\n
1 I
2 have
...

#text: The dog\nis red.
1 The
2 dog
...

Note:

foxik commented 8 years ago

Thanks @fginter for writing. My thoughts:

BTW, note that the "sentence in the comments" approach have the same issues as SpacesBefore/SpacesAfter -- there will be some escaping, and it is ambiguous where to put inter-sentence spaces. Also both play nice with SpaceAfter=No. So I believe the issues are generic for the problem we are trying to solve.

From my point of view, SpacesBefore/SpacesAfter are still the first choice -- the information is directly attached to the tokens (and not in the comments), so you do not need to reconstruct the information if you are interested in knowing "what spaces followed this token". In the "sentence in the comment" approach, even for "what spaces follow the sentence" question you have do something like finding the last token in the sentence text first, because it is nontrivial to say which characters at the end of sentence text do not belong in the last token (some space-like characters are not in Zs unicode category -- if you find a character in Cf category, which may really happen, does it belong to the token or is it a space between sentences?) Also, for common cases (single spaces between tokens), the SpacesBefore/SpacesAfter are more efficient (they would not be present, except probably for SpaceAfter on the last token; while sentence text has to repeat all tokens and spaces).

PS:

If a tokenizer really finds it important to replace double quotes with something else, why not use the token-word distinction. I.e. the token is whatever is in the sentence, and the word is whatever the tokenizer wants to make it and the word is then part of the tree.

Yes, using token-word distinction for replacements is an elegant approach I was thinking about in the past. However, it is not obvious whether it is allowed by the current CoNLL-U format, because only "multi-word" tokens seem to be allowed. Specifically, I assume that single-word tokens would use a range like x-x, so something like:

1-1 ˝
1 "

If you think it is worth it (I do, so that would be two of us), we could create another issue discussing it.

fginter commented 8 years ago

I think

1-1 ˝
1 "

would be quite okay.

fginter commented 8 years ago

PS:

foxik commented 8 years ago

Thanks @fginter for your answers. I think we both know each other's positions (and that they are different :-)

Any other opinions and preferences, anyone?

spyysalo commented 8 years ago

# text: (or # sentence-text:) is already used and there appears to be general support for standardizing it in #273. As this proposal has (partially) overlapping goals and no clear consensus, I would suggest to postpone decisions until # text: is standardized (or removed, if so happens).

After we know how # text: gets defined, we can re-examine whether there are use-cases for spacesAfter and spacesBefore that it doesn't cover.

I would generally prefer to avoid redundant ways to specify the same information, as this adds unnecessary complexity and ways to have internally inconsistent data.

(FWIW, although I'm not deeply invested in this feature, I'm not thrilled with SpaceAfter and would prefer not to extend it; it just doesn't feel like the Right Thing™ to me.)

dan-zeman commented 8 years ago

OK, here is my position:

martinpopel commented 8 years ago

And here are my opinions:

In general (probably as everyone here), I would like to keep CoNLL-U simple, uncluttered and intuitive, especially in the typical use case. For this reason, I was originally against SpacesAfter and SpacesBefore as it complicates the specification and is redundant once we have # text =. However, if we want to standardize storing spaces between sentences, the original @foxik's proposal sounds good - there is no need to store \s at the end of each # text = because one space after each token is the default case. So SpacesAfter will be used only if there are multiple spaces between tokens or other whitespace than space. This is relatively rare, so a typical file will remain uncluttered. Thus my current position is neutral.

amir-zeldes commented 8 years ago

I also agree that:

dan-zeman commented 8 years ago

FWIW, I do not like abusing the multi-word token ranges for single-word replacements (1-1). These are meant for grammar-conditioned contractions. Mere presence of such lines says something about the language, while this would be just about the particular tokenization/normalization approach. Although it could be distinguished by i being equal to j, I strongly oppose it.

Any modifications relating to just one word should stay at the line dedicated to that word. As said earlier, this relates to my proposal in #330. A MISC attribute containing the modified text (if the original stays in FORM; or the original text, if FORM is modified) is all what is needed.

reckart commented 7 years ago

I might have missed it - this thread is quite long - but did you consider text between sentences, e.g.

#text: I have a dog.
1 I
2 have
...

#text: \n\n  

#text: The dog is red.
1 The
2 dog
...

So here would be a segment of text with two line breaks between the two sentences.

Cf. discussion on this over at WebAnno: https://github.com/webanno/webanno/issues/313#issuecomment-232486111

dan-zeman commented 7 years ago

We did not reach a consensus (I think) but we did consider it. If it is represented it will be at least technically part of either the preceding or the following sentence. Whether it is part of the #text comment (which I personally do not welcome) or some other type of comment (either sentence-level comment, or token-level MISC attribute), is mostly what this thread is about.

spyysalo commented 7 years ago

Closing as v2 is now being published.

For reference, comments used to specify the original sentence are in v2 with the format # text = (http://universaldependencies.org/format.html). These permit but do not standardize escape sequences (e.g. # text = ...\n\n). SpaceAfter is likewise in and its use is strongly encouraged:

[...] information on original word segmentation should be kept if available. Every token after which there was no space in the original text should contain SpaceAfter=No in its MISC field.

SpacesAfter and SpacesBefore are not part of the v2 standard, but as the MISC field is free-form excepting the minimal constraint that it "has to be formatted as a list that can be split on the bar character (|) without special escaping", particular applications are not prohibited from using these.